Evaluation of Gene-Finding Programs on Mammalian Sequences

Size: px
Start display at page:

Download "Evaluation of Gene-Finding Programs on Mammalian Sequences"

Transcription

1 Letter Evaluation of Gene-Finding Programs on Mammalian Sequences Sanja Rogic, 1 Alan K. Mackworth, 2 and Francis B.F. Ouellette 3 1 Computer Science Department, The University of California at Santa Cruz, Santa Cruz 95064, California; 2 Computer Science Department, The University of British Columbia, Vancouver, BC V6T 1Z4, Canada; 3 Centre for Molecular Medicine and Therapeutics, Vancouver, BC V5Z 4H4, Canada We present an independent comparative analysis of seven recently developed gene-finding programs: FGENES, GeneMark.hmm, Genie, Genscan, HMMgene, Morgan, and MZEF. For evaluation purposes we developed a new, thoroughly filtered, and biologically validated dataset of mammalian genomic sequences that does not overlap with the training sets of the programs analyzed. Our analysis shows that the new generation of programs has substantially better results than the programs analyzed in previous studies. The accuracy of the programs was also examined as a function of various sequence and prediction features, suchas G + C content of the sequence, length and type of exons, signal type, and score of the exon prediction. This approach pinpoints the strengths and weaknesses of each individual program as well as those of computational gene-finding in general. The dataset used in this analysis (HMR195) as well as the tables with the complete results are available at rogic/evaluation/. Currently, in genome centers around the world, millions of bases of genomic DNA from different organisms are sequenced every day. With the recently assembled draft sequence of the human genome in hand and the completed sequence to follow in a couple of years, we need to re-evaluate our methods for deciphering such an enormous amount of data. We present here the results of a comprehensive evaluation of recent computer programs used for the identification of protein coding genes in eukaryotic genomic sequences. Because we expect that such an analysis will be of interest to both biologists and computer scientists, we will first provide an overview of gene structure and computational methods for genefinding. Gene Structure The genes of most eukaryotic organisms are neither continuous nor contiguous. They are separated by long stretches of intergenic DNA and their coding sequences are interrupted by noncoding introns. Coding sequences occupy just a small fraction of a typical higher eukaryotic genome; the extreme example is the human genome, where an estimate of that fraction at 3% (Duret et al. 1995) was recently confirmed for chromosome 22 (Dunham et al. 1999). To obtain a continuous coding sequence which will be translated into a protein sequence, genes are transcribed into long premrna molecules that subsequently undergo complex 5 Corresponding author. rogic@cse.ucsc.edu; FAX: (831) Article and publication are at /cgi/doi/ / gr processing to remove intronic sequences and assemble exons to form mrna. However, assembly of the gene exons in the mature mrna is not always the same; Mironov et al. (1999) found that at least 35% of human genes are alternatively spliced having more than one possible exon assembly. The arrangement of genes in genomes is also prone to exceptions. Although usually separated with an intergenic region, there are examples of genes nested within each other (Dunham et al. 1999); that is, one gene located in an intron of another gene or overlapping genes on the same (Schulz and Butler 1989; Ashburner et al. 1999) or opposite (Cooper et al. 1998) DNA strands. The presence of pseudogenes (nonfunctional sequences resembling real genes) which are distributed in numerous copies throughout the genome further complicates the identification of true protein coding genes. Regulatory regions play a crucial role in gene expression, and their identification is needed to fully comprehend a gene s function, activity, and role in cellular processes. The location of regulatory regions relative to their target gene is not uniquely determined; the basic regulatory elements, such as the TATA and CAT boxes, are usually found in the upstream proximity of the transcription start site, while the other elements such as enhancers and silencers, can be located in distant upstream and downstream regions of a gene and sometimes even within the introns of the gene. This brief overview of genome organization and gene architecture highlights the complexity of gene identification in the sequences of uncharacterized DNA. 11: by Cold Spring Harbor Laboratory Press ISSN /01 $5.00; Genome Research 817

2 Rogic et al. Computational Methods for Identification of Genes There are several methods used for the experimental discovery of genes, but they are time-consuming and costly. Accordingly, for the last 15 years researchers have been developing computational methods for gene-finding that can automate, or facilitate, the identification of genes. Two basic approaches have been established for computational gene-finding: the sequence similarity search, or lookup (Fickett 1996), method and the integrated compositional and signal search, or template (Fickett 1996), method. The latter method is also commonly referred to as ab initio gene finding. Sequence similarity search is a well-established computational method for gene discovery which has been used extensively with considerable success. It is based on sequence conservation due to the functional constraints and is used to search for regions of similarity between an uncharacterized sequence of interest and already characterized sequences in a public sequence database. Significant similarity between two sequences suggests that they are homologous, that is, they have common evolutionary origin. A query sequence can be compared with DNA, protein, or expressed sequence tag (EST) sequences or it can be searched for known sequence motifs. If a query sequence is found to be significantly similar to an already annotated sequence (DNA or protein), we can use the information from the annotated sequence to possibly infer gene structure or function of the query sequence. Comparison with an EST database can provide information if the sequence of interest is transcribed, that is, contains an expressed gene, but will only give incomplete clues about the structure of the whole gene or its function. Although sequence similarity search has been proven useful in many cases, it has been shown that only a fraction of newly discovered sequences have identifiable homologs in the current databases (Oliver et al. 1992; Wilson et al. 1994; Dunham et al. 1999). Furthermore, Green et al. (1993) suggested that currently known proteins may already include representatives of most ancient conserved regions (ACRs, regions of protein sequences showing highly significant similarity across phyla) and that new sequences not similar to any database sequence are unlikely to contain ACRs. The proportion of vertebrate genes with no detectable similarity in other phyla is estimated to be 50% (Claverie 1997). This is supported by a recent analysis of human chromosome 22 (Dunham et al. 1999) where only 50% of the proteins are found to be similar to previously known proteins. These results suggest that, even today, only one half of all new vertebrate genes may be discovered by sequence similarity search across phyla. Considering that a complete vertebrate genome is still not available and that the most prominent vertebrate organisms in GenBank (Benson et al. 2000), Homo sapiens, and Mus musculus, have only 25% and 0.6% of their genomes present in finished sequences, respectively (data from September 2000), it is obvious that sequence similarity search within vertebrates is currently limited. When more vertebrate sequences become available in Gen- Bank (such as mouse, zebrafish, or pufferfish), matches within phyla will be more likely and this will facilitate the detection of genes coding for non-acr-containing proteins. The second computational approach for the prediction of genes structures in the genomic DNA sequences, termed the template approach, integrates coding statistics with signal detection into one framework. Coding statistics behave differently on coding and noncoding regions and they are measures indicative of protein coding functions. A number of these measures have been evaluated by Fickett and Tung (1992), and it has been concluded that the in-phase hexamer measure, which measures the frequency of occurrence of oligonucleotides of length six in a specific reading frame, is the most effective. Indeed, this measure was used successfully in many recently developed programs such as GeneMark.hmm (Lukashin and Borodovski 1998), Genscan (Burge 1997), and HMMgene (Krogh 1997). This coding statistic is usually implemented as a 5th-order hidden Markov model (HMM) (for a review of the theory of HMMs, see Rabiner 1989). Signal sensors attempt to mimic closely processes occurring within the cell. They are intended to identify sequence signals, usually just several nucleotides-long subsequences, which are recognized by cell machinery and are initiators of certain processes. The signals that are usually modeled by gene-finding programs are promoter elements, start and stop codons, splice sites, and poly-a sites. Many different pattern recognition methods have been used as signal detectors, including simple consensus sequences, weight matrices, weight arrays, neural networks, and decision trees. DNA sequence signals have low information content; they are usually degenerate and highly unspecific because it is almost impossible to distinguish the signals truly processed by the cell from those that are apparently nonfunctional. Therefore, signal sensors are not sufficient to elucidate gene structure, and it is necessary to combine them with coding statistics methods in order to obtain satisfactory predictive power. Both codon statistics and signal models are learned from a training set: Frequencies of oligonucleotide occurrence in different regions of the genes are calculated from sequences in the training set, and the signal models are constructed using the alignment of the signal sequences from the training set. 818 Genome Research

3 Gene-Finding Programs on Mammalian Sequences There is also a group of programs that integrate a third component in their systems: similarity with an annotated sequence. Examples of such programs are GeneID+ (Guigo et al. 1992), GeneParser3 (Snyder and Stormo 1995), Procrustes (Gelfand et al. 1996), and AAT (Huang et al. 1997). Existing gene-finding programs have been designed to identify simpler gene structure and then the intricate structure described above. Most of the programs, especially the older ones, are trained to identify just one gene in a sequence, rarely predicting any promoter elements. Some progress has been made with recently developed programs which are capable of identifying more complex genomic structure: any number of genes with either complete or partial structure. This is the case with Genie (Kulp et al. 1996), GeneMark.hmm, Genscan, and HMMgene. Still, regulatory regions and poly-a sites usually remain unidentified, 5 and 3 untranslated regions are not specified, alternative splice variants are not considered, and overlapping or nested genes are not detected. Nevertheless, the prediction of the coding sequence of typical genes is an important first step in deciphering the content of any genome, and the programs for gene-finding are used extensively for this task with considerable success. An excellent Web resource for computational gene recognition, including the URLs of gene-finding programs and a comprehensive bibliography on gene recognition and related subjects, is maintained by Wentian Li at Evaluation of Gene-Finders Because in many cases there is no additional evidence to support the gene predictions provided by ab initio gene-finding programs, it is very important to know the accuracy level of these programs. The reliability of the programs concerns both users and developers. Lab bench experiments are often based on the gene/exon predictions, and they usually require a substantial investment in time and resources. This is why it is important for a user to know how well a certain algorithm performs, what its strengths and weaknesses are, and how to interpret a particular score given by the program. For developers it is valuable to know the current state of the art, to relate the program s efficiency and reliability to the methods used, and to recognize the weaknesses that need to be addressed. To evaluate gene-finding programs meaningfully it is necessary to do it uniformly on one test set of sequences. It is also important to avoid using sequences used for the training of programs analyzed, otherwise the accuracy of the programs may be overestimated. Previous comparative analyses of gene-finding programs have been performed by Snyder and Stormo in 1995 and Burset and Guigo in Snyder and Stormo (1995) analyzed three gene-finding programs GeneID, GRAIL (Xu et al. 1994) (two versions), and GeneParser (three versions of the program) on rather limited test sets containing 28 and 34 sequences. A more comprehensive evaluation of gene structure prediction programs was done by Burset and Guigo (1996). These authors tested nine programs on a test set of 570 sequences and introduced a number of performance metrics to measure the accuracy of prediction on the nucleotide, exon, and gene levels. Some of these measures were known and had been used previously (sensitivity, specificity, and correlation coefficient at the nucleotide level) and some were newly introduced (approximate correlation, sensitivity, and specificity at the exon level). Those authors also investigated the behavior of the programs on sequences with errors (frameshift mutations), sequences with differing G + C content and sequences from different phylogenic groups within the vertebrates. This comprehensive analysis has been a valuable resource for both users and developers of gene prediction programs, and the Burset/Guigo dataset has been used extensively as a benchmark dataset for testing new generations of programs. In the four years since the Burset/Guigo analysis was published, many new programs have been developed. The accuracy measures for most of them have been reported for the Burset/Guigo dataset. A reason for concern is not just that authors tested their programs themselves, but also in many cases it is not clear how the sequences from a program s training set overlap with the Burset/Guigo test set. It is realistic to assume that, in many cases, the training sets of these programs do overlap with the Burset/Guigo dataset because at one time it contained the vast majority of available vertebrate genomic sequences. Here we present a new independent comparative analysis of seven recently developed programs that use only coding and signal information for the prediction of gene structure: FGENES (V. Solovyev, unpubl.), GeneMark.hmm, Genie, Genscan, HMMgene, Morgan (Salzberg et al. 1998), and MZEF (Zhang 1997). The programs were tested on genomic sequences containing one single- or multi-exon gene of human, mouse or rat origin. In order to avoid overlap with the training sets of the programs, we selected only sequences entered in GenBank after the programs were developed and trained. The initial dataset was further filtered to exclude any anomalous sequences (e.g., containing atypical start codon or splice site dinucleotide, inframe stop codon), and then we subjected it to nonredundancy testing to eliminate groups or pairs of very similar sequences. Finally, we kept only sequences for which we could confirm the annotated exon/intron boundaries by aligning them with a corresponding Genome Research 819

4 Rogic et al. mrna sequence. The final set consisted of 195 sequences as described below. For all the programs tested, the basic accuracy measures introduced by Burset and Guigo (1996) were calculated. We chose to compute only nucleotide and exon level accuracy measures, because the prediction of the entire gene structure is still unreliable and seldom used. As an illustration, Dunham et al. (1999) identified 94% at least partially predicted genes on human chromosome 22 using Genscan, but only 20% of genes had all exons predicted exactly. We also examined the programs accuracy as a function of various sequence or prediction features, such as G + C content of the sequence, length and type of the annotated and predicted exons, signal type of the annotated and predicted exons, and the score/probability of the predicted exon, motivated by a similar detailed analysis done by Burge (1997). RESULTS The accuracy measures for five of the seven programs analyzed in this study have already been reported for the Burset/Guigo dataset. However, the results should be considered with some care because it is not clear how this test dataset and the training sets used overlapped. This problem motivated us to build a dataset that does not contain any of the sequences that were part of any training set of these programs. This was accomplished simply by choosing sequences that were entered into GenBank after these programs were trained, but there are no guarantees that there are no similarities between these newly selected sequences and sequences in the programs training datasets. Our opinion is that an independent dataset should not necessarily exclude the sequences that are similar to the sequences in the training set, because, realistically, the unannotated sequence submitted to the gene-finding program might also be similar to an already known and characterized sequence that has been use in the program s training set. The same approach was advocated by Snyder and Stormo (1995). Sequences from the test dataset HMR195 were run through the seven gene prediction programs. For each sequence, the exons predicted on the forward strand (predictions for the reverse strand were ignored) were compared to the actual coding exons, as annotated in the GenBank CDS feature. Although all of the programs tested, except Morgan, can predict genes and exons on both DNA strands simultaneously, the Gen- Bank records for most of the sequences in HMR195 contain only annotation for the Watson/plus strand, and consequently only prediction for that strand could be confirmed. From this comparison, accuracy measures at the nucleotide and exon levels were computed and then averaged. We chose to use averaging by sequence, where measures are first calculated for each gene and then averaged over all genes, as opposed to averaging by base, where measures are summed for all sequences and then averaged by nucleotide or by exon, depending on the measure type. The former is thought to give a better indication of the success rate for the individual sequence entry. For a discussion of this topic, see Dong and Searls (1995) and Burset and Guigo (1996). The measures are averaged only over sequences for which they are defined. This might overestimate the values for sensitivity (Sn), specificity (Sp), approximate correlation (AC), correlation coefficient (CC), exon level sensitivity (ESn), exon level specificity (ESp), and the average of the latter two on the sets that have many sequences without prediction. So, in order to obtain a realistic estimate of the gene-finders performance, one must also look at the number of sequences where no genes were predicted. The accuracy measures for all of the programs analyzed, averaged for the entire HMR195 dataset, are presented in Table 1. The next step was to examine accuracy as a function of various sequence or prediction features, such as G + C content of the sequence, length and type of the annotated exons and predicted exons, signal type for both annotated and predicted genes, and the score/ probability of the exon prediction. For each of these characteristics, the dataset was divided into the subsets exhibiting different value ranges or types of the characteristic examined. For each subset, the accuracy measures were calculated and averaged over all of the sequences belonging to it. These results are presented in the tables below; the seventh table shows the accuracy results for human and murine sequences separately. Because of space limitations we show only some of the accuracy measures calculated. The full results for each program can be found at rogic/evaluation/results.html. DISCUSSION Comparing the results presented by Burset and Guigo (1996) with the results obtained in the present study (Table 1), it is apparent that the new generation of programs has, overall, substantially higher prediction accuracy than the programs analyzed by Burset and Guigo in At that time, the program with the best approximate correlation (among the programs not using any database similarity search) was FGENEH, with AC = 0.78, while the highest AC in 1999 is 0.91, exhibited by both Genscan and HMMgene. On the exon level, (ESn + ESp)/2 has increased from 0.64 for FGENEH to 0.76 for HMMgene. Also, earlier gene-finders were programmed to have low false-positive rates at the expense of losing valid predictions, which resulted in, on average, 20% higher specificity than sensitivity. Programs of the new generation are tuned to have equally 820 Genome Research

5 Gene-Finding Programs on Mammalian Sequences Table 1. Nucleotide and Exon Level Accuracy Programs No. of sequences Nucleotide accuracy Exon accuracy Sn Sp AC CC ESn ESp (ESn+ESp)/2 ME WE PCa PCp OL FGENES 195 (5) GeneMark.hmm 195 (0) Genie 195 (15) Genscan 195 (3) HMMgene 195 (5) Morgan 127 (0) MZEF 119 (8) For each sequence in the HMR195 dataset, the exons predicted on the forward (+) strand were compared to the annotated exons. The standard measures of predictive accuracy on nucleotide and exon level were calculated for each sequence and averaged over all sequences for which they were defined. This was done separately for each of the programs tested. (No. of sequences) number of sequences effectively analyzed by each program; in parentheses is the number of sequences where the absence of gene was predicted; (Sn) nucleotide level sensitivity; (Sp) nucleotide level specificity; (AC) approximate correlation; (CC) correlation coefficient; (ESn) exon level sensitivity; (ESp) exon level specificity; (ME) missed exons; (WE) wrong exons; (PCa) proportion of real exons that were partially predicted (only one exon boundary correct); (PCp) proportion of predicted exons that were only partially correct; (OL) proportion of predicted exons that overlap an actual exon. AC and (ESn+ESp)/2 are given with standard deviation. high sensitivity and specificity, which is more desirable. These improvements have come about as a result of the development of more accurate models for gene structure that are capable of recognizing many different gene features in the sequence. Most of the genefinders analyzed use explicit duration HMM with associated length distribution for each state. These models of genomic structure are hierarchical, with generalized HMM modeling the overall gene structure, where states of the model are independent probabilistic models themselves, such as HMMs and neural networks. Also, new methods have been developed for signal recognition, such as maximal dependence decomposition used for the recognition of donor site in Genscan, neural networks in Genie, and ribosomal binding sites in GeneMark.hmm. The training sets are carefully selected and the average number of training sequences in the dataset has increased, allowing for more diversity in genomic content of the training sequences. With the accuracy measures at the nucleotide level as high as 0.91 for the AC value for both Genscan and HMMgene, we might conclude that the problem of computational gene-finding is almost solved. However, looking at the results for exon sensitivity and specificity and their average, we see that the goal is still far away. Why is there such a gap between AC and (ESn + ESp)/2? Since ESn and ESp are defined as true exons (TE) divided by annotated exons (AE) and predicted exons (PE), respectively, this means that in calculating these two measures only exons with both boundaries predicted correctly will be considered. An almost perfectly predicted exon, covering the whole sequence of an actual exon but exceeding the splicing site by just a few nucleotides will not be counted in TE. In order to predict the exact boundaries of an exon, a program has to have a strong search by signal component signal sensors for identifying start and stop codons and splicing sites. However, signal detection, especially of start and stop codons, is probably the weakest component of current gene-finding programs, as can be observed in Table 2. Although discrimination among coding and noncoding regions, most often done by measuring the hexamer frequencies in these Table 2. Programs Accuracy versus Signal Type start codon (195) Signal type acceptor site (753) donor site (753) stop codon (195) FGENES (0.63) (0.77) (0.82) (0.72) GeneMark.hmm (0.60) (0.75) (0.78) (0.64) Genie (0.57) (0.82) (0.83) (0.73) Genscan (0.78) (0.80) (0.84) (0.86) HMMgene (0.78) (0.85) (0.87) (0.81) Morgan (0.43) (0.57) (0.56) (0.39) MZEF (0.65) (0.73) For each program, the proportion of actual signals identified correctly (the upper number) and the proportion of predicted signals that are correct (the lower number) are averaged over all signals belonging to a particular type. The number in parenthesis in the header of each column represents the number of signals of each type in the HMR195 dataset. Genome Research 821

6 Rogic et al. regions, has shown to be quite successful, signal recognition could still be improved. There has been significant effort to improve the prediction of acceptor and donor splice, and many different methods have been used for this task, such as neural networks and maximal dependence decomposition (the methods for splice site detection are not known to us for all of the programs analyzed). The success of these methods is apparent in Table 2. On the other hand, we are not aware of any systematic effort to tackle the problem of start and stop codon detection. These signals are considered to have low information content, and they are usually detected by using weight (positional) matrices, weight arrays that capture dependencies between adjacent nucleotides or, in the case of Genie, neural networks for translation initiation site. The tendency to miss actual signals can also be observed from the proportion of partially predicted exons (PCa) that ranges from 0.08 for Genie to 0.29 for GeneMark.hmm (Table 1). GeneMark.hmm, and Morgan, besides having high proportions of PCa, also have a relatively high proportion of OL, the proportion of predicted exons that overlap an actual exon (0.09 and 0.07, respectively, Table 1), and according to these results they are the programs with the poorest signal detection. If we add the number of the partially predicted exons to the number of correctly predicted exons and use this number for calculating ESn and ESp, then AC and (ESn + ESp)/2 would have similar values. G + C Content The human genome (and the genomes of other warmblooded vertebrates) is not a structurally homogenous sequence of nucleotides. Instead, it is a mosaic of isochores, long (>300 kb, on average) DNA regions whose base composition is locally homogenous, but varies significantly between disjoint regions. The genome is usually divided into five different compositional categories: L1, L2 (A + T-rich regions), H1, H2, and H3 (G + C-rich regions) in increasing order of G + C percentage. It has been observed (Bernardi 1993) that L1 + L2 constitute 60% of the human genome, H1 + H2 30%, and H3 only 5%. These compositional regions vary widely in gene density: Zoubak et al. (1996) calculated that L1 + L2 regions have a relative gene concentration of 4%, H1 + H2, 20%, and H3 76%. This means that the gene density in very G + C-rich DNA segments is almost 20 times higher than in A + T-rich regions. Important structural properties of genes are found to be strongly correlated with G + C content (Duret et al. 1995): genes from G + C-poor isochores code for proteins that are on average longer then those from G + C-rich isochores, intronic DNA is on average three times longer in L1 + L2 than in H3, and the number of introns per gene is higher in L1 + L2 than in H3. How does compositional variability in genomic sequences affect the performance of gene-finding programs? Burset and Guigo (1996), Snyder and Stormo (1995), and Lopez et al. (1994) have shown in their analyses that gene-finding programs usually perform worse when the G + C content is low. The proposed reasons for this anomaly are that G + C-rich genes have stronger codon bias that makes them easier to identify and that they are more frequent than the genes in A + T-rich isochores. Guigo and Fickett (1995) showed that coding statistics used by gene-finding programs (codon, dicodon, and hexamer frequency) are strongly dependent on G + C content. It is obvious that if a program has only one set of parameters intended to model gene structure (oligonucleotide frequency, length of coding and intergenic region, exon and intron length and number), it will not be able to perform equally well in both A + T- and G + C-rich sequences due to the significant structural differences between genes in these sequences. The reason why programs perform better for G + C-rich sequences could also be because they are trained on the sequence subset of GenBank, which is biased towards G + C-rich sequences. According to Duret et al. (1995), genes from G + C-rich isochores are much more frequently sequenced than those from G + C-poor isochores. Some programs, such as Genscan, HMMgene, and MZEF tested in this survey, recently adopted the approach of using distinct, empirically derived model parameters for distinct G + C compositional regions. Table 3 presents the programs accuracy measures on the sequences with different G + C content. We partitioned our dataset into four groups according to the G + C content of the sequences. These groups are closely related to previously defined isochores except that the very G + C-rich isochore was split into two groups because it was heavily populated. Seven percent (14/195) of the sequences came from L1 + L2 isochores (more precisely with G + C% 40%), 35% (69/195) of sequences from H1 + H2 (40% <G + C% 50%) and 57% (112/195) of sequences from H3 (G + C% >50%), which were subsequently split into two groups (50% <G+C% 60% and G + C% >60%), containing 93 and 19 sequences, respectively. These percentages are significantly inconsistent with the results from Bernardi (1993), which points out the huge bias for the G + C- rich sequences in GenBank. Consistent with the observations made in Burset and Guigo (1996), it seems that some programs are sensitive to the G + C content of a sequence, performing better when the sequence is G + C-rich. The programs which exhibited this in our analysis are FGENES on the nucleotide level, GeneMark.hmm and Genie on both levels, and HMMgene marginally on the exon level. Among programs that are known to use different parameter sets for different G + C content, Genscan s 822 Genome Research

7 Gene-Finding Programs on Mammalian Sequences Table 3. Accuracy versus G + C Content <40%(14) 40 50%(69) 50 60%(93) >60%(19) C + G content AC (Esn+Esp)/2 AC (Esn+Esp)/2 AC (Esn+Esp)/2 AC (Esn+Esp)/2 FGENES GeneMark.hmm Genie Genscan HMMgene Morgan MZEF The HMR195 dataset was partitioned according to the G + C% content of the sequences. The number in parenthesis in the header of each column represents the number of sequences belonging to each partition. For each program, AC and (ESn+ESp)/2 are averaged over all sequences belonging to the particular partition for which they are defined. and HMMgene s prediction accuracy is relatively independent of the base composition, but MZEF still has very variable results, especially on the exon level, that are not proportional to the G + C content of a sequence. The situation is similar for Morgan. There is one peculiarity noted in Table 3: All of the programs except Morgan have the lowest accuracy measures averaged on the sequences with G + C content between 40% and 50%. Because this is not the region with the lowest G + C composition, it is not clear whether the programs really do perform most poorly for this type of sequences or whether there is some characteristic of our test set that causes this slight drop in prediction accuracy. Exon Length The length distributions of different gene elements differ considerably from each other. Introns seem to have an approximate geometric length distribution (Hawkins 1988; Burge 1997), which is a characteristic of a discrete stochastic process with the memoryless property (Karlin and Taylor 1975). This supports the idea that introns do not have any significant constraints on their length, except that the minimal number of nucleotides (70 80) is required (Wieringa et al. 1984). Exons, in contrast, have significant functional constraints. The exon length plays an important role in proper splicing and inclusion in the mature mrna (Dominski and Kole 1991). These constraints have shaped the exon length distribution quite differently from geometric distribution. The length distribution depends on the exon type. Internal exons have length distribution close to a Gaussian distribution with a broad peak between 100 and 170 bp (Hawkins 1988). Hawkins calculated the mean internal exon length to be 137 bp (in our dataset 136 bp), and he observed very few exons shorter than 50 or longer than 300 nucleotides. Length distributions for initial and terminal coding exons are not recognizable statistical distributions. They are still substantially peaked around 60 and 160 bp, respectively (Hawkins 1988; Burge 1997), but do not have a steep drop-off in density after 300 bp. Both types of exons are more variable in length than internal exons, and their calculated means are 134 bp for initial exons and 198 bp for terminal exons (Hawkins 1988) (in our dataset 207 bp for initial and 265 bp for terminal). For the fourth class of exons, single-exons, the length distribution is not known, but in general they are much longer than any other type of exons and their mean length is calculated to be 1300 bp (Hawkins 1988) (in our dataset 1010 bp). In our analysis, we grouped exons by both their annotated length and their predicted length and averaged the accuracy measures in each group. Because many of the programs tested in this analysis (Genie, GeneMark.hmm, Genscan, and HMMgene) use explicit duration HMM, which has associated length distribution to each state of the model, it is interesting to see how these distributions influence the accuracy of their exon prediction. From Table 4 it can be observed that the general trend of all seven of the programs is to have a very low proportion of correctly predicted short exons, which then slowly but monotonically rises with the length of annotated exons. For almost all of the programs, exons are most accurately predicted when their length ranges between 75 and 200 nucleotides (these exons were the most common: 560 of 839). The exons longer than 200 nucleotides (our dataset contained 131 of these exons) seem more difficult to predict correctly, and the accuracy measures drop further as the length increases. The exception is HMMgene, which predicts longer exons with the same accuracy as the more common mediumlength exons. The exons shorter then 25 bases (there were only 17) are missed in 41% of cases for FGENES, up to 88% for MZEF. The most plausible explanation for this Genome Research 823

8 Rogic et al. Table 4. Accuracy versus Exon Length Length range of exons in bp Programs 0 24 (22) (49) (91) (130) (440) (91) 300+ (125) FGENES (0.33) (0.42) (0.64) (0.75) (0.81) (0.61) (0.66) GeneMark.hmm (0.12) (0.51) (0.58) (0.72) (0.73) (0.62) (0.45) Genie (0.18) (0.47) (0.66) (0.81) (0.83) (0.68) (0.69) Genscan (0.29) (0.81) (0.79) (0.85) (0.76) (0.71) (0.65) HMMgene (0.42) (0.76) (0.75) (0.77) (0.85) (0.72) (0.74) Morgan (0.14) (0.14) (0.31) (0.57) (0.57) (0.41) (0.35) MZEF (0.00) (0.44) (0.45) (0.58) (0.73) (0.53) (0.26) The HMR195 dataset was partitioned according to the length of the annotated exons. The number in parenthesis in the header of each column represents the number of actual exons belonging to each partition. For each program, CRa (the proportion of real exons that are correctly predicted [the upper number]) and CRp (the proportion of predicted exons that are correct [the number in parentheses]) are averaged over all sequences belonging to that particular partition. phenomenon is that the length of the coding region is too short to be clearly distinguished from surrounding noncoding regions. In addition, there is biochemical evidence that this type of exon is inefficiently spliced in vivo without the presence of special splicing activating sequences (Dominski and Kole 1991). Finally, the associated length distributions used by some programs do not favor very short exons, and depending how these distributions are used by the systems, this may cause poor prediction for this type of exon. Although very long exons are less likely to be predicted correctly than medium-length exons, they are most unlikely to be completely missed. The number of partially predicted exons longer than 300 nucleotides is relatively large (data not shown), and only <7% of them are completely missed (the exception is MZEF with 33% of exons missed). Lastly, it can be noted from Table 4 that there is usually a significant difference between the proportion of annotated exons that are correctly predicted (CRa) and the proportion of predicted exons that are exactly correct (CRp) for very short exons. The reason for this is that whereas programs FGENES, Genie, and Morgan overpredict short exons, the rest of the programs underpredict them the total number of short exons predicted by each of these programs is much lower than the actual number of exons of the same size. Again, this may be the consequence of exon length distribution built into the gene-finding programs. This discrepancy in the numbers of real and predicted exons is much smaller for the longer exons. Exon Type and Signal Prediction Table 5 summarizes accuracy measures for different exon types. What can be observed is a striking difference between the proportion of correctly predicted internal exons on one side and the proportion of correctly predicted initial and terminal exons on the other side. This difference is partially eroded with a high number of partially predicted initial and terminal ex- Table 5. Programs Accuracy versus Exon Type initial (152) internal (601) Exon type terminal (152) single (43) FGENES (0.55) (0.78) (0.58) (0.83) GeneMark.hmm (0.48) (0.72) (0.51) (0.65) Genie (0.45) (0.82) (0.57) (0.68) Genscan (0.71) (0.76) (0.73) (0.83) HMMgene (0.72) (0.83) (0.73) (0.79) Morgan (0.35) (0.46) (0.36) MZEF The HMR195 dataset was partitioned according to the type of the annotated exons. The number in parenthesis in the header of each column represents the number of actual exons belonging to each partition. For each program, CRa (the upper number) and CRp (the number in parentheses), are averaged over all sequences belonging to that particular partition. 824 Genome Research

9 Gene-Finding Programs on Mammalian Sequences ons (data not shown), especially if we allow the predicted exon to be of any type; but still initial and terminal exons are more likely to be completely missed than internal exons. For single-exon genes, the situation is similar in the sense that the proportion of correctly predicted single exons (CRa) is usually significantly lower than CRa for internal exons (the exceptions are HMMgene and Genie), but they have very high PCa, the proportion of partially predicted exons (the extreme case is GeneMark.hmm with CRa = 0.30 and PCa = 0.56 for single exons). The difference is that single-exon genes are very rarely missed, and the proportion of missed exons of this type is the lowest among all exon types and all programs. The only program that is almost equally successful in predicting exons of any type is HMMgene, which also has the highest proportion of correctly predicted exons (CRp) for initial, terminal and single exons among all of the programs. This HMMgene characteristic surely contributes to its excellent results on our dataset. Why are initial, terminal, and single exons more difficult to identify? The only obvious structural differences among different types of exons are the signals bordering them; there are no studies showing that codon usage (hexamer frequency) fluctuates among different exon types. The difference in exon length could be a reason, since internal exons (136 bp in our dataset) belong to a group of exons more likely to be identified correctly than exons longer than 200 bp, which is the case with initial and terminal exons (207 bp and 265 bp, respectively) (Table 4). However, the differences in accuracy level shown in Table 4 do not compensate for the large differences shown in Table 5. Our hypothesis that signal prediction is mainly responsible for the difference in accuracy levels is supported by the results given in Table 2: The detection of start and stop codons is much less accurate than that of acceptor and donor sites (again, the exception is HMMgene), and the difference in accuracy level is proportional to the accuracy level difference for initial and terminal exons versus internal exons shown in Table 5. As noted above, during the assembly of HMR195 we were not able to validate the locations of annotated start and stop codons. Consequently, prediction accuracy measures calculated for these signals as well as subsequent analysis and discussion strongly rely on the correctness of their annotation in GenBank. Burge and Karlin (1998) presented similar results for Genscan signal detection on a different dataset. The situation is a bit more complex for single-exon genes: on the one hand they contain both start and terminal codons which should complicate their identification even further, but on the other their average length in our dataset is 1022 bp, which according to the analysis described above in the section entitled Exon Length makes them hard to predict exactly, but difficult to miss. This directly corresponds to the results shown in Table 5. What can also be observed in Table 5 is that for the programs FGENES, GeneMark.hmm, and Genscan there is a significant difference between CRa and CRp for the single exons. Analogous to the case of very short exons, the cause of this phenomenon is that these programs are conservative in predicting single-exon genes: the number of single exons predicted by any of these programs is much lower then the number of real ones in our dataset. The consequence of this is that single exons predicted by these programs have a very good chance of being correct, while many real such exons remain unidentified. Exon Probabilities and Scores Each of the programs evaluated in this study except GeneMark.hmm has a scoring scheme for its exon prediction. Genscan, HMMgene and MZEF have a probability score for each exon predicted that is supposed to be a quantitative measure of the likelihood that the given exon is correct. Morgan s scores were originally intended to be probabilities, but that intention was not followed through subsequent upgrades, and what is left is a scale with no formal meaning except that very high scores result from motifs that Morgan has seen Table 6. Accuracy versus Probability Probability range of predicted exons Programs Genscan (112) (159) (132) (93) (481) HMMgene (91) (173) (136) (96) (406) MZEF (111) (104) (72) (258) The HMR195 dataset was partitioned according to probability of the predicted exons. For each program, CRp (proportion of predicted exons that are correct) is averaged over all sequences belonging to that particular partition. The number in parenthesis is the number of exons belonging to each partition. Genome Research 825

10 Rogic et al. before. Genie uses the bit score against a background distribution that is dependent on the length of the predicted exon and thus the scores cannot be meaningfully compared to other exon scores. The way FGENES calculates exon scores is not known to us. Since the nature of Morgan s and Genie s scores makes them uninformative for a user, we tested the reliability of exon score for the other four programs: FGENES, Genscan, HMMgene, and MZEF. The results for FGENES (data not shown) appear to show that its scores are not directly useful. Most of the exons predicted have a score <10, and CR values average to the similar levels for any subregion on the scale from 0 to10. The only informative exon score is above 10, since, at least in our dataset, these exons are correctly predicted in 90% of cases, which is significantly higher than the CR for exons with lower scores. However, scores this high are rarely assigned to an exon prediction. The accuracy measures for different regions of probability scores for Genscan, HMMgene, and MZEF are displayed in Table 6. What can be observed is that CR values are monotonically rising with the increase in exon probability. For Genscan and HMMgene, these values are usually close to the lower boundary of a probability range (the exception is the probability region , where CR values are lower than probabilities). MZEF, on the other hand, significantly overestimates probabilities for its exon predictions. If we relax our criterion and include partially predicted exons (the results for PC values are not shown), CR values will reach and sometimes overreach the upper boundary of a probability region (CR values for MZEF will correspond to the probability region average). This analysis shows that in the case of Genscan and HMMgene, the exon probability score can be a very useful guide to the reliability of the exon prediction. Phylogenetic Specificity All of the programs analyzed in this survey were trained on human sequences except for Morgan, which was trained on the dataset of vertebrate sequences collected by Burset and Guigo (1996). Since the dataset we used to test the programs was composed of 103 human and 92 murine (82 Mus musculus and 10 Rattus norvegicus) sequences, we wanted to investigate whether such a phylogenetic mix can corrupt the performance of the gene-finding programs, especially those calibrated for human sequences. The AC values on the nucleotide level and for (ESn + ESp)/2 on the exon level for each of the programs, but separately for human and murine sequences, are given in Table 7. It can be observed that the difference in accuracy measures between human and mouse/rat are marginal. Even more interesting is that in most cases the values for murine sequences are higher than the values for the human sequence, even though the model parameters of the programs were learned from the set of human sequences. It is likely that such differences are not statistically significant and that they would also be observed if the results on two different human sequence sets were compared. This hypothesis is also supported by the comparison of the human and mouse grammars constructed by Dong and Searls (1994), where no significant differences were found. Conclusions The results obtained in this analysis indicate that the new generation of programs has significantly higher accuracy then the programs analyzed by Burset and Guigo in Comparing the programs with the highest approximation correlation in their study and our study, we find that this value has improved from 0.78 (FGENEH) to 0.91 (Genscan and HMMgene), a 17% increase, while the highest averaged exon sensitivity and specificity has improved from 0.64 (FGENEH) to 0.76 (HMMgene), a 19% increase. The behavior of the programs on the sequences with different G + C content is not systematic. Some programs accuracy appears to be slightly dependent Table 7. Phylogenetic Specificity Nucleotide accuracy-ac Exon accuracy (ESn+ESp)/2 Programs Trained on human murine human murine FGENES human GeneMark.hmm human Genie human Genscan human HMMgene human Morgan vertebrate MZEF human The HMR195 dataset was split into two species subsets containing 103 human and 92 murine sequences. For each subset and each program, AC and (ESn+ESp)/2 were averaged over all sequences belonging to the particular subset. 826 Genome Research

Outline. Gene Finding Questions. Recap: Prokaryotic gene finding Eukaryotic gene finding The human gene complement Regulation

Outline. Gene Finding Questions. Recap: Prokaryotic gene finding Eukaryotic gene finding The human gene complement Regulation Tues, Nov 29: Gene Finding 1 Online FCE s: Thru Dec 12 Thurs, Dec 1: Gene Finding 2 Tues, Dec 6: PS5 due Project presentations 1 (see course web site for schedule) Thurs, Dec 8 Final papers due Project

More information

Gene Structure & Gene Finding Part II

Gene Structure & Gene Finding Part II Gene Structure & Gene Finding Part II David Wishart david.wishart@ualberta.ca 30,000 metabolite Gene Finding in Eukaryotes Eukaryotes Complex gene structure Large genomes (0.1 to 10 billion bp) Exons and

More information

UCSC Genome Browser. Introduction to ab initio and evidence-based gene finding

UCSC Genome Browser. Introduction to ab initio and evidence-based gene finding UCSC Genome Browser Introduction to ab initio and evidence-based gene finding Wilson Leung 06/2006 Outline Introduction to annotation ab initio gene finding Basics of the UCSC Browser Evidence-based gene

More information

Outline. Introduction to ab initio and evidence-based gene finding. Prokaryotic gene predictions

Outline. Introduction to ab initio and evidence-based gene finding. Prokaryotic gene predictions Outline Introduction to ab initio and evidence-based gene finding Overview of computational gene predictions Different types of eukaryotic gene predictors Common types of gene prediction errors Wilson

More information

Model. David Kulp, David Haussler. Baskin Center for Computer Engineering and Computer Science.

Model. David Kulp, David Haussler. Baskin Center for Computer Engineering and Computer Science. Integrating Database Homology in a Probabilistic Gene Structure Model David Kulp, David Haussler Baskin Center for Computer Engineering and Computer Science University of California, Santa Cruz CA, 95064,

More information

Genie Gene Finding in Drosophila melanogaster

Genie Gene Finding in Drosophila melanogaster Methods Gene Finding in Drosophila melanogaster Martin G. Reese, 1,2,4 David Kulp, 2 Hari Tammana, 2 and David Haussler 2,3 1 Berkeley Drosophila Genome Project, Department of Molecular and Cell Biology,

More information

Eukaryotic Gene Prediction. Wei Zhu May 2007

Eukaryotic Gene Prediction. Wei Zhu May 2007 Eukaryotic Gene Prediction Wei Zhu May 2007 In nature, nothing is perfect... - Alice Walker Gene Structure What is Gene Prediction? Gene prediction is the problem of parsing a sequence into nonoverlapping

More information

Methods and Algorithms for Gene Prediction

Methods and Algorithms for Gene Prediction Methods and Algorithms for Gene Prediction Chaochun Wei 韦朝春 Sc.D. ccwei@sjtu.edu.cn http://cbb.sjtu.edu.cn/~ccwei Shanghai Jiao Tong University Shanghai Center for Bioinformation Technology 5/12/2011 K-J-C

More information

TIGR THE INSTITUTE FOR GENOMIC RESEARCH

TIGR THE INSTITUTE FOR GENOMIC RESEARCH Introduction to Genome Annotation: Overview of What You Will Learn This Week C. Robin Buell May 21, 2007 Types of Annotation Structural Annotation: Defining genes, boundaries, sequence motifs e.g. ORF,

More information

Gene Signal Estimates from Exon Arrays

Gene Signal Estimates from Exon Arrays Gene Signal Estimates from Exon Arrays I. Introduction: With exon arrays like the GeneChip Human Exon 1.0 ST Array, researchers can examine the transcriptional profile of an entire gene (Figure 1). Being

More information

Annotating 7G24-63 Justin Richner May 4, Figure 1: Map of my sequence

Annotating 7G24-63 Justin Richner May 4, Figure 1: Map of my sequence Annotating 7G24-63 Justin Richner May 4, 2005 Zfh2 exons Thd1 exons Pur-alpha exons 0 40 kb 8 = 1 kb = LINE, Penelope = DNA/Transib, Transib1 = DINE = Novel Repeat = LTR/PAO, Diver2 I = LTR/Gypsy, Invader

More information

Genomics and Gene Recognition Genes and Blue Genes

Genomics and Gene Recognition Genes and Blue Genes Genomics and Gene Recognition Genes and Blue Genes November 3, 2004 Eukaryotic Gene Structure eukaryotic genomes are considerably more complex than those of prokaryotes eukaryotic cells have organelles

More information

Fig Ch 17: From Gene to Protein

Fig Ch 17: From Gene to Protein Fig. 17-1 Ch 17: From Gene to Protein Basic Principles of Transcription and Translation RNA is the intermediate between genes and the proteins for which they code Transcription is the synthesis of RNA

More information

Using Expressing Sequence Tags to Improve Gene Structure Annotation

Using Expressing Sequence Tags to Improve Gene Structure Annotation Washington University in St. Louis Washington University Open Scholarship All Computer Science and Engineering Research Computer Science and Engineering Report Number: WUCS-2006-25 2006-05-01 Using Expressing

More information

Eukaryotic Gene Structure

Eukaryotic Gene Structure Eukaryotic Gene Structure Terminology Genome entire genetic material of an individual Transcriptome set of transcribed sequences Proteome set of proteins encoded by the genome 2 Gene Basic physical and

More information

Transcription in Eukaryotes

Transcription in Eukaryotes Transcription in Eukaryotes Biology I Hayder A Giha Transcription Transcription is a DNA-directed synthesis of RNA, which is the first step in gene expression. Gene expression, is transformation of the

More information

MODULE 1: INTRODUCTION TO THE GENOME BROWSER: WHAT IS A GENE?

MODULE 1: INTRODUCTION TO THE GENOME BROWSER: WHAT IS A GENE? MODULE 1: INTRODUCTION TO THE GENOME BROWSER: WHAT IS A GENE? Lesson Plan: Title Introduction to the Genome Browser: what is a gene? JOYCE STAMM Objectives Demonstrate basic skills in using the UCSC Genome

More information

Chapter 14 Active Reading Guide From Gene to Protein

Chapter 14 Active Reading Guide From Gene to Protein Name: AP Biology Mr. Croft Chapter 14 Active Reading Guide From Gene to Protein This is going to be a very long journey, but it is crucial to your understanding of biology. Work on this chapter a single

More information

Integrating Genomic Homology into Gene Structure Prediction

Integrating Genomic Homology into Gene Structure Prediction BIOINFORMATICS Vol. 1 Suppl. 1 2001 Pages S1 S9 Integrating Genomic Homology into Gene Structure Prediction Ian Korf 1, Paul Flicek 2, Daniel Duan 1 and Michael R. Brent 1 1 Department of Computer Science,

More information

Year III Pharm.D Dr. V. Chitra

Year III Pharm.D Dr. V. Chitra Year III Pharm.D Dr. V. Chitra 1 Genome entire genetic material of an individual Transcriptome set of transcribed sequences Proteome set of proteins encoded by the genome 2 Only one strand of DNA serves

More information

Chimp Chunk 3-14 Annotation by Matthew Kwong, Ruth Howe, and Hao Yang

Chimp Chunk 3-14 Annotation by Matthew Kwong, Ruth Howe, and Hao Yang Chimp Chunk 3-14 Annotation by Matthew Kwong, Ruth Howe, and Hao Yang Ruth Howe Bio 434W April 1, 2010 INTRODUCTION De novo annotation is the process by which a finished genomic sequence is searched for

More information

The Genetic Code and Transcription. Chapter 12 Honors Genetics Ms. Susan Chabot

The Genetic Code and Transcription. Chapter 12 Honors Genetics Ms. Susan Chabot The Genetic Code and Transcription Chapter 12 Honors Genetics Ms. Susan Chabot TRANSCRIPTION Copy SAME language DNA to RNA Nucleic Acid to Nucleic Acid TRANSLATION Copy DIFFERENT language RNA to Amino

More information

CHAPTER 21 LECTURE SLIDES

CHAPTER 21 LECTURE SLIDES CHAPTER 21 LECTURE SLIDES Prepared by Brenda Leady University of Toledo To run the animations you must be in Slideshow View. Use the buttons on the animation to play, pause, and turn audio/text on or off.

More information

Analysis of Biological Sequences SPH

Analysis of Biological Sequences SPH Analysis of Biological Sequences SPH 140.638 swheelan@jhmi.edu nuts and bolts meet Tuesdays & Thursdays, 3:30-4:50 no exam; grade derived from 3-4 homework assignments plus a final project (open book,

More information

Biotechnology Explorer

Biotechnology Explorer Biotechnology Explorer C. elegans Behavior Kit Bioinformatics Supplement explorer.bio-rad.com Catalog #166-5120EDU This kit contains temperature-sensitive reagents. Open immediately and see individual

More information

Transcription Eukaryotic Cells

Transcription Eukaryotic Cells Transcription Eukaryotic Cells Packet #20 1 Introduction Transcription is the process in which genetic information, stored in a strand of DNA (gene), is copied into a strand of RNA. Protein-encoding genes

More information

Chapter 14: Gene Expression: From Gene to Protein

Chapter 14: Gene Expression: From Gene to Protein Chapter 14: Gene Expression: From Gene to Protein This is going to be a very long journey, but it is crucial to your understanding of biology. Work on this chapter a single concept at a time, and expect

More information

DNA Transcription. Dr Aliwaini

DNA Transcription. Dr Aliwaini DNA Transcription 1 DNA Transcription-Introduction The synthesis of an RNA molecule from DNA is called Transcription. All eukaryotic cells have five major classes of RNA: ribosomal RNA (rrna), messenger

More information

8/21/2014. From Gene to Protein

8/21/2014. From Gene to Protein From Gene to Protein Chapter 17 Objectives Describe the contributions made by Garrod, Beadle, and Tatum to our understanding of the relationship between genes and enzymes Briefly explain how information

More information

SSA Signal Search Analysis II

SSA Signal Search Analysis II SSA Signal Search Analysis II SSA other applications - translation In contrast to translation initiation in bacteria, translation initiation in eukaryotes is not guided by a Shine-Dalgarno like motif.

More information

Chimp Sequence Annotation: Region 2_3

Chimp Sequence Annotation: Region 2_3 Chimp Sequence Annotation: Region 2_3 Jeff Howenstein March 30, 2007 BIO434W Genomics 1 Introduction We received region 2_3 of the ChimpChunk sequence, and the first step we performed was to run RepeatMasker

More information

Baum-Welch and HMM applications. November 16, 2017

Baum-Welch and HMM applications. November 16, 2017 Baum-Welch and HMM applications November 16, 2017 Markov chains 3 states of weather: sunny, cloudy, rainy Observed once a day at the same time All transitions are possible, with some probability Each state

More information

Genome Annotation. What Does Annotation Describe??? Genome duplications Genes Mobile genetic elements Small repeats Genetic diversity

Genome Annotation. What Does Annotation Describe??? Genome duplications Genes Mobile genetic elements Small repeats Genetic diversity Genome Annotation Genome Sequencing Costliest aspect of sequencing the genome o But Devoid of content Genome must be annotated o Annotation definition Analyzing the raw sequence of a genome and describing

More information

Chapter 17: From Gene to Protein

Chapter 17: From Gene to Protein Name Period This is going to be a very long journey, but it is crucial to your understanding of biology. Work on this chapter a single concept at a time, and expect to spend at least 6 hours to truly master

More information

DNA Transcription. Visualizing Transcription. The Transcription Process

DNA Transcription. Visualizing Transcription. The Transcription Process DNA Transcription By: Suzanne Clancy, Ph.D. 2008 Nature Education Citation: Clancy, S. (2008) DNA transcription. Nature Education 1(1) If DNA is a book, then how is it read? Learn more about the DNA transcription

More information

A Markovian Approach for the Analysis of the Gene Structure

A Markovian Approach for the Analysis of the Gene Structure A Markovian Approach for the Analysis of the Gene Structure Christelle Melo de Lima 1, Laurent Guéguen 1, Christian Gautier 1 and Didier Piau 2 1 UMR 5558 CNRS Biométrie et Biologie Evolutive, Université

More information

Genes and gene finding

Genes and gene finding Genes and gene finding Ben Langmead Department of Computer Science You are free to use these slides. If you do, please sign the guestbook (www.langmead-lab.org/teaching-materials), or email me (ben.langmead@gmail.com)

More information

Gene Expression: Transcription

Gene Expression: Transcription Gene Expression: Transcription The majority of genes are expressed as the proteins they encode. The process occurs in two steps: Transcription = DNA RNA Translation = RNA protein Taken together, they make

More information

Themes: RNA and RNA Processing. Messenger RNA (mrna) What is a gene? RNA is very versatile! RNA-RNA interactions are very important!

Themes: RNA and RNA Processing. Messenger RNA (mrna) What is a gene? RNA is very versatile! RNA-RNA interactions are very important! Themes: RNA is very versatile! RNA and RNA Processing Chapter 14 RNA-RNA interactions are very important! Prokaryotes and Eukaryotes have many important differences. Messenger RNA (mrna) Carries genetic

More information

Transcription. DNA to RNA

Transcription. DNA to RNA Transcription from DNA to RNA The Central Dogma of Molecular Biology replication DNA RNA Protein transcription translation Why call it transcription and translation? transcription is such a direct copy

More information

Review of Protein (one or more polypeptide) A polypeptide is a long chain of..

Review of Protein (one or more polypeptide) A polypeptide is a long chain of.. Gene expression Review of Protein (one or more polypeptide) A polypeptide is a long chain of.. In a protein, the sequence of amino acid determines its which determines the protein s A protein with an enzymatic

More information

Machine Learning Methods for RNA-seq-based Transcriptome Reconstruction

Machine Learning Methods for RNA-seq-based Transcriptome Reconstruction Machine Learning Methods for RNA-seq-based Transcriptome Reconstruction Gunnar Rätsch Friedrich Miescher Laboratory Max Planck Society, Tübingen, Germany NGS Bioinformatics Meeting, Paris (March 24, 2010)

More information

RNA and Protein Synthesis

RNA and Protein Synthesis Harriet Wilson, Lecture Notes Bio. Sci. 4 - Microbiology Sierra College RNA and Protein Synthesis Considerable evidence suggests that RNA molecules evolved prior to DNA molecules and proteins, and that

More information

How Computing Science Saved The Human Genome Project Athabasca Hall Sept. 9, 2013

How Computing Science Saved The Human Genome Project Athabasca Hall Sept. 9, 2013 How Computing Science Saved The Human Genome Project david.wishart@ualberta.ca 3-41 Athabasca Hall Sept. 9, 2013 The Human Genome Project HGP Announcement June 26, 2000*** J. Craig Venter Francis Collins

More information

MOLECULAR GENETICS PROTEIN SYNTHESIS. Molecular Genetics Activity #2 page 1

MOLECULAR GENETICS PROTEIN SYNTHESIS. Molecular Genetics Activity #2 page 1 AP BIOLOGY MOLECULAR GENETICS ACTIVITY #2 NAME DATE HOUR PROTEIN SYNTHESIS Molecular Genetics Activity #2 page 1 GENETIC CODE PROTEIN SYNTHESIS OVERVIEW Molecular Genetics Activity #2 page 2 PROTEIN SYNTHESIS

More information

Bio 101 Sample questions: Chapter 10

Bio 101 Sample questions: Chapter 10 Bio 101 Sample questions: Chapter 10 1. Which of the following is NOT needed for DNA replication? A. nucleotides B. ribosomes C. Enzymes (like polymerases) D. DNA E. all of the above are needed 2 The information

More information

Ch. 10 Notes DNA: Transcription and Translation

Ch. 10 Notes DNA: Transcription and Translation Ch. 10 Notes DNA: Transcription and Translation GOALS Compare the structure of RNA with that of DNA Summarize the process of transcription Relate the role of codons to the sequence of amino acids that

More information

#28 - Promoter Prediction 10/29/07

#28 - Promoter Prediction 10/29/07 BCB 444/544 Required Reading (before lecture) Lecture 28 Mon Oct 29 - Lecture 28 Promoter & Regulatory Element Prediction Chp 9 - pp 113-126 Gene Prediction - finish it Wed Oct 30 - Lecture 29 Phylogenetics

More information

Chapter 10 - Molecular Biology of the Gene

Chapter 10 - Molecular Biology of the Gene Bio 100 - Molecular Genetics 1 A. Bacterial Transformation Chapter 10 - Molecular Biology of the Gene Researchers found that they could transfer an inherited characteristic (e.g. the ability to cause pneumonia),

More information

From Gene to Protein transcription, messenger RNA (mrna) translation, RNA processing triplet code, template strand, codons,

From Gene to Protein transcription, messenger RNA (mrna) translation, RNA processing triplet code, template strand, codons, From Gene to Protein I. Transcription and translation are the two main processes linking gene to protein. A. RNA is chemically similar to DNA, except that it contains ribose as its sugar and substitutes

More information

TRANSCRIPTION COMPARISON OF DNA & RNA TRANSCRIPTION. Umm AL Qura University. Sugar Ribose Deoxyribose. Bases AUCG ATCG. Strand length Short Long

TRANSCRIPTION COMPARISON OF DNA & RNA TRANSCRIPTION. Umm AL Qura University. Sugar Ribose Deoxyribose. Bases AUCG ATCG. Strand length Short Long Umm AL Qura University TRANSCRIPTION Dr Neda Bogari TRANSCRIPTION COMPARISON OF DNA & RNA RNA DNA Sugar Ribose Deoxyribose Bases AUCG ATCG Strand length Short Long No. strands One Two Helix Single Double

More information

Feature Selection of Gene Expression Data for Cancer Classification: A Review

Feature Selection of Gene Expression Data for Cancer Classification: A Review Available online at www.sciencedirect.com ScienceDirect Procedia Computer Science 50 (2015 ) 52 57 2nd International Symposium on Big Data and Cloud Computing (ISBCC 15) Feature Selection of Gene Expression

More information

GeneMarkS-2: Raising Standards of Accuracy in Gene Recognition

GeneMarkS-2: Raising Standards of Accuracy in Gene Recognition GeneMarkS-2: Raising Standards of Accuracy in Gene Recognition Alexandre Lomsadze 1^, Shiyuyun Tang 2^, Karl Gemayel 3^ and Mark Borodovsky 1,2,3 ^ joint first authors 1 Wallace H. Coulter Department of

More information

MATH 5610, Computational Biology

MATH 5610, Computational Biology MATH 5610, Computational Biology Lecture 2 Intro to Molecular Biology (cont) Stephen Billups University of Colorado at Denver MATH 5610, Computational Biology p.1/24 Announcements Error on syllabus Class

More information

Introduction to genome biology

Introduction to genome biology Introduction to genome biology Lisa Stubbs We ve found most genes; but what about the rest of the genome? Genome size* 12 Mb 95 Mb 170 Mb 1500 Mb 2700 Mb 3200 Mb #coding genes ~7000 ~20000 ~14000 ~26000

More information

Outline. Evolution. Adaptive convergence. Common similarity problems. Chapter 7: Similarity searches on sequence databases

Outline. Evolution. Adaptive convergence. Common similarity problems. Chapter 7: Similarity searches on sequence databases Chapter 7: Similarity searches on sequence databases All science is either physics or stamp collection. Ernest Rutherford Outline Why is similarity important BLAST Protein and DNA Interpreting BLAST Individualizing

More information

Protein Synthesis. OpenStax College

Protein Synthesis. OpenStax College OpenStax-CNX module: m46032 1 Protein Synthesis OpenStax College This work is produced by OpenStax-CNX and licensed under the Creative Commons Attribution License 3.0 By the end of this section, you will

More information

Gene Prediction: Statistical Approaches

Gene Prediction: Statistical Approaches Gene Prediction: Statistical Approaches Outline Codons Discovery of Split Genes Exons and Introns Splicing Open Reading Frames Codon Usage Splicing Signals TestCode Gene Prediction: Computational Challenge

More information

Machine learning applications in genomics: practical issues & challenges. Yuzhen Ye School of Informatics and Computing, Indiana University

Machine learning applications in genomics: practical issues & challenges. Yuzhen Ye School of Informatics and Computing, Indiana University Machine learning applications in genomics: practical issues & challenges Yuzhen Ye School of Informatics and Computing, Indiana University Reference Machine learning applications in genetics and genomics

More information

Intron Definition Is Required for Excision of the Minute Virus of Mice Small Intron and Definition of the Upstream Exon

Intron Definition Is Required for Excision of the Minute Virus of Mice Small Intron and Definition of the Upstream Exon JOURNAL OF VIROLOGY, Mar. 1998, p. 1834 1843 Vol. 72, No. 3 0022-538X/98/$04.00 0 Copyright 1998, American Society for Microbiology Intron Definition Is Required for Excision of the Minute Virus of Mice

More information

Why learn sequence database searching? Searching Molecular Databases with BLAST

Why learn sequence database searching? Searching Molecular Databases with BLAST Why learn sequence database searching? Searching Molecular Databases with BLAST What have I cloned? Is this really!my gene"? Basic Local Alignment Search Tool How BLAST works Interpreting search results

More information

Chapter 13. From DNA to Protein

Chapter 13. From DNA to Protein Chapter 13 From DNA to Protein Proteins All proteins consist of polypeptide chains A linear sequence of amino acids Each chain corresponds to the nucleotide base sequenceof a gene The Path From Genes to

More information

Agenda. Web Databases for Drosophila. Gene annotation workflow. GEP Drosophila annotation projects 01/01/2018. Annotation adding labels to a sequence

Agenda. Web Databases for Drosophila. Gene annotation workflow. GEP Drosophila annotation projects 01/01/2018. Annotation adding labels to a sequence Agenda GEP annotation project overview Web Databases for Drosophila An introduction to web tools, databases and NCBI BLAST Web databases for Drosophila annotation UCSC Genome Browser NCBI / BLAST FlyBase

More information

Genomics and Gene Recognition Genes and Blue Genes

Genomics and Gene Recognition Genes and Blue Genes Genomics and Gene Recognition Genes and Blue Genes November 1, 2004 Prokaryotic Gene Structure prokaryotes are simplest free-living organisms studying prokaryotes can give us a sense what is the minimum

More information

A Machine Learning Approach to Test Data Generation: A Case Study in Evaluation of Gene Finders

A Machine Learning Approach to Test Data Generation: A Case Study in Evaluation of Gene Finders A Machine Learning Approach to Test Data Generation: A Case Study in Evaluation of Gene Finders Henning Christiansen 1 and Christina Mackeprang Dahmcke 2 1 Research group PLIS: Programming, Logic and Intelligent

More information

Chromosome-scale scaffolding of de novo genome assemblies based on chromatin interactions. Supplementary Material

Chromosome-scale scaffolding of de novo genome assemblies based on chromatin interactions. Supplementary Material Chromosome-scale scaffolding of de novo genome assemblies based on chromatin interactions Joshua N. Burton 1, Andrew Adey 1, Rupali P. Patwardhan 1, Ruolan Qiu 1, Jacob O. Kitzman 1, Jay Shendure 1 1 Department

More information

Bio11 Announcements. Ch 21: DNA Biology and Technology. DNA Functions. DNA and RNA Structure. How do DNA and RNA differ? What are genes?

Bio11 Announcements. Ch 21: DNA Biology and Technology. DNA Functions. DNA and RNA Structure. How do DNA and RNA differ? What are genes? Bio11 Announcements TODAY Genetics (review) and quiz (CP #4) Structure and function of DNA Extra credit due today Next week in lab: Case study presentations Following week: Lab Quiz 2 Ch 21: DNA Biology

More information

BLAST. compared with database sequences Sequences with many matches to high- scoring words are used for final alignments

BLAST. compared with database sequences Sequences with many matches to high- scoring words are used for final alignments BLAST 100 times faster than dynamic programming. Good for database searches. Derive a list of words of length w from query (e.g., 3 for protein, 11 for DNA) High-scoring words are compared with database

More information

Introduction to Microarray Data Analysis and Gene Networks. Alvis Brazma European Bioinformatics Institute

Introduction to Microarray Data Analysis and Gene Networks. Alvis Brazma European Bioinformatics Institute Introduction to Microarray Data Analysis and Gene Networks Alvis Brazma European Bioinformatics Institute A brief outline of this course What is gene expression, why it s important Microarrays and how

More information

Outline. Annotation of Drosophila Primer. Gene structure nomenclature. Muller element nomenclature. GEP Drosophila annotation projects 01/04/2018

Outline. Annotation of Drosophila Primer. Gene structure nomenclature. Muller element nomenclature. GEP Drosophila annotation projects 01/04/2018 Outline Overview of the GEP annotation projects Annotation of Drosophila Primer January 2018 GEP annotation workflow Practice applying the GEP annotation strategy Wilson Leung and Chris Shaffer AAACAACAATCATAAATAGAGGAAGTTTTCGGAATATACGATAAGTGAAATATCGTTCT

More information

Finishing Fosmid DMAC-27a of the Drosophila mojavensis third chromosome

Finishing Fosmid DMAC-27a of the Drosophila mojavensis third chromosome Finishing Fosmid DMAC-27a of the Drosophila mojavensis third chromosome Ruth Howe Bio 434W 27 February 2010 Abstract The fourth or dot chromosome of Drosophila species is composed primarily of highly condensed,

More information

Gene Prediction: Statistical Approaches

Gene Prediction: Statistical Approaches Gene Prediction: Statistical Approaches Outline Codons Discovery of Split Genes Exons and Introns Splicing Open Reading Frames Codon Usage Splicing Signals TestCode Gene Prediction: Computational Challenge

More information

SAMPLE LITERATURE Please refer to included weblink for correct version.

SAMPLE LITERATURE Please refer to included weblink for correct version. Edvo-Kit #340 DNA Informatics Experiment Objective: In this experiment, students will explore the popular bioninformatics tool BLAST. First they will read sequences from autoradiographs of automated gel

More information

Make the protein through the genetic dogma process.

Make the protein through the genetic dogma process. Make the protein through the genetic dogma process. Coding Strand 5 AGCAATCATGGATTGGGTACATTTGTAACTGT 3 Template Strand mrna Protein Complete the table. DNA strand DNA s strand G mrna A C U G T A T Amino

More information

Solutions to Quiz II

Solutions to Quiz II MIT Department of Biology 7.014 Introductory Biology, Spring 2005 Solutions to 7.014 Quiz II Class Average = 79 Median = 82 Grade Range % A 90-100 27 B 75-89 37 C 59 74 25 D 41 58 7 F 0 40 2 Question 1

More information

Optimization of RNAi Targets on the Human Transcriptome Ahmet Arslan Kurdoglu Computational Biosciences Program Arizona State University

Optimization of RNAi Targets on the Human Transcriptome Ahmet Arslan Kurdoglu Computational Biosciences Program Arizona State University Optimization of RNAi Targets on the Human Transcriptome Ahmet Arslan Kurdoglu Computational Biosciences Program Arizona State University my background Undergraduate Degree computer systems engineer (ASU

More information

Evolutionary Algorithms

Evolutionary Algorithms Evolutionary Algorithms Evolutionary Algorithms What is Evolutionary Algorithms (EAs)? Evolutionary algorithms are iterative and stochastic search methods that mimic the natural biological evolution and/or

More information

Lecture Four. Molecular Approaches I: Nucleic Acids

Lecture Four. Molecular Approaches I: Nucleic Acids Lecture Four. Molecular Approaches I: Nucleic Acids I. Recombinant DNA and Gene Cloning Recombinant DNA is DNA that has been created artificially. DNA from two or more sources is incorporated into a single

More information

Mapping strategies for sequence reads

Mapping strategies for sequence reads Mapping strategies for sequence reads Ernest Turro University of Cambridge 21 Oct 2013 Quantification A basic aim in genomics is working out the contents of a biological sample. 1. What distinct elements

More information

Exploring Similarities of Conserved Domains/Motifs

Exploring Similarities of Conserved Domains/Motifs Exploring Similarities of Conserved Domains/Motifs Sotiria Palioura Abstract Traditionally, proteins are represented as amino acid sequences. There are, though, other (potentially more exciting) representations;

More information

Pauling/Itano Experiment

Pauling/Itano Experiment Chapter 12 Pauling/Itano Experiment Linus Pauling and Harvey Itano knew that hemoglobin, a molecule in red blood cells, contained an electrical charge. They wanted to see if the hemoglobin in normal RBC

More information

DNA RNA PROTEIN SYNTHESIS -NOTES-

DNA RNA PROTEIN SYNTHESIS -NOTES- DNA RNA PROTEIN SYNTHESIS -NOTES- THE COMPONENTS AND STRUCTURE OF DNA DNA is made up of units called nucleotides. Nucleotides are made up of three basic components:, called deoxyribose in DNA In DNA, there

More information

Gene Structure Prediction From Many. Engineering Sciences Lab 102. Harvard University. Phone: (617)

Gene Structure Prediction From Many. Engineering Sciences Lab 102. Harvard University. Phone: (617) Gene Structure Prediction From Many Attributes Adam A. Deaton Rocco A. Servedio Engineering Sciences Lab 102 Division of Engineering and Applied Sciences Harvard University Cambridge, MA 02138 faabd,roccog@cs.harvard.edu

More information

7 Gene Isolation and Analysis of Multiple

7 Gene Isolation and Analysis of Multiple Genetic Techniques for Biological Research Corinne A. Michels Copyright q 2002 John Wiley & Sons, Ltd ISBNs: 0-471-89921-6 (Hardback); 0-470-84662-3 (Electronic) 7 Gene Isolation and Analysis of Multiple

More information

Gene prediction in novel fungal genomes using an ab initio algorithm with unsupervised training

Gene prediction in novel fungal genomes using an ab initio algorithm with unsupervised training Methods Gene prediction in novel fungal genomes using an ab initio algorithm with unsupervised training Vardges Ter-Hovhannisyan, 1,4 Alexandre Lomsadze, 2,4 Yury O. Chernoff, 1 and Mark Borodovsky 2,3,5

More information

CHAPTER 17 FROM GENE TO PROTEIN. Section C: The Synthesis of Protein

CHAPTER 17 FROM GENE TO PROTEIN. Section C: The Synthesis of Protein CHAPTER 17 FROM GENE TO PROTEIN Section C: The Synthesis of Protein 1. Translation is the RNA-directed synthesis of a polypeptide: a closer look 2. Signal peptides target some eukaryotic polypeptides to

More information

Lecture 11: Gene Prediction

Lecture 11: Gene Prediction Lecture 11: Gene Prediction Study Chapter 6.11-6.14 1 Gene: A sequence of nucleotides coding for protein Gene Prediction Problem: Determine the beginning and end positions of genes in a genome Where are

More information

Guided tour to Ensembl

Guided tour to Ensembl Guided tour to Ensembl Introduction Introduction to the Ensembl project Walk-through of the browser Variations and Functional Genomics Comparative Genomics BioMart Ensembl Genome browser http://www.ensembl.org

More information

Gene Expression Transcription

Gene Expression Transcription Why? ene Expression Transcription How is mrn synthesized and what message does it carry? DN is often referred to as a genetic blueprint. In the same way that blueprints contain the instructions for construction

More information

GENE EXPRESSION AT THE MOLECULAR LEVEL. Copyright (c) The McGraw-Hill Companies, Inc. Permission required for reproduction or display.

GENE EXPRESSION AT THE MOLECULAR LEVEL. Copyright (c) The McGraw-Hill Companies, Inc. Permission required for reproduction or display. GENE EXPRESSION AT THE MOLECULAR LEVEL Copyright (c) The McGraw-Hill Companies, Inc. Permission required for reproduction or display. 1 Gene expression Gene function at the level of traits Gene function

More information

Synthetic Biology. Sustainable Energy. Therapeutics Industrial Enzymes. Agriculture. Accelerating Discoveries, Expanding Possibilities. Design.

Synthetic Biology. Sustainable Energy. Therapeutics Industrial Enzymes. Agriculture. Accelerating Discoveries, Expanding Possibilities. Design. Synthetic Biology Accelerating Discoveries, Expanding Possibilities Sustainable Energy Therapeutics Industrial Enzymes Agriculture Design Build Generate Solutions to Advance Synthetic Biology Research

More information

Files for this Tutorial: All files needed for this tutorial are compressed into a single archive: [BLAST_Intro.tar.gz]

Files for this Tutorial: All files needed for this tutorial are compressed into a single archive: [BLAST_Intro.tar.gz] BLAST Exercise: Detecting and Interpreting Genetic Homology Adapted by W. Leung and SCR Elgin from Detecting and Interpreting Genetic Homology by Dr. J. Buhler Prequisites: None Resources: The BLAST web

More information

Neural Networks and Applications in Bioinformatics. Yuzhen Ye School of Informatics and Computing, Indiana University

Neural Networks and Applications in Bioinformatics. Yuzhen Ye School of Informatics and Computing, Indiana University Neural Networks and Applications in Bioinformatics Yuzhen Ye School of Informatics and Computing, Indiana University Contents Biological problem: promoter modeling Basics of neural networks Perceptrons

More information

Training materials.

Training materials. Training materials Ensembl training materials are protected by a CC BY license http://creativecommons.org/licenses/by/4.0/ If you wish to re-use these materials, please credit Ensembl for their creation

More information

Chromosomes. Chromosomes. Genes. Strands of DNA that contain all of the genes an organism needs to survive and reproduce

Chromosomes. Chromosomes. Genes. Strands of DNA that contain all of the genes an organism needs to survive and reproduce Chromosomes Chromosomes Strands of DNA that contain all of the genes an organism needs to survive and reproduce Genes Segments of DNA that specify how to build a protein genes may specify more than one

More information

Zhengyan Kan 1, Warren Gish 2, Eric Rouchka 1, Jarret Glasscock 2, and David States 1

Zhengyan Kan 1, Warren Gish 2, Eric Rouchka 1, Jarret Glasscock 2, and David States 1 From: ISMB-00 Proceedings. Copyright 2000, AAAI (www.aaai.org). All rights reserved. UTR Reconstruction and Analysis Using Genomically Aligned EST Sequences Zhengyan Kan 1, Warren Gish 2, Eric Rouchka

More information

Adv Biology: DNA and RNA Study Guide

Adv Biology: DNA and RNA Study Guide Adv Biology: DNA and RNA Study Guide Chapter 12 Vocabulary -Notes What experiments led up to the discovery of DNA being the hereditary material? o The discovery that DNA is the genetic code involved many

More information

DNA and RNA. Chapter 12

DNA and RNA. Chapter 12 DNA and RNA Chapter 12 Warm Up Exercise Test Corrections Make sure to indicate your new answer and provide an explanation for why this is the correct answer. Do this with a red pen in the margins of your

More information

Whole Transcriptome Analysis of Illumina RNA- Seq Data. Ryan Peters Field Application Specialist

Whole Transcriptome Analysis of Illumina RNA- Seq Data. Ryan Peters Field Application Specialist Whole Transcriptome Analysis of Illumina RNA- Seq Data Ryan Peters Field Application Specialist Partek GS in your NGS Pipeline Your Start-to-Finish Solution for Analysis of Next Generation Sequencing Data

More information

Protein Synthesis

Protein Synthesis HEBISD Student Expectations: Identify that RNA Is a nucleic acid with a single strand of nucleotides Contains the 5-carbon sugar ribose Contains the nitrogen bases A, G, C and U instead of T. The U is

More information