Chapter 4 METHOD FOR PREDICTING GENES IN EUKARYOTIC GENOMES

Size: px
Start display at page:

Download "Chapter 4 METHOD FOR PREDICTING GENES IN EUKARYOTIC GENOMES"

Transcription

1 qene Prediction Chapter 4 METHOD FOR PREDICTING GENES IN EUKARYOTIC GENOMES 4.1 INTRODUCTION Evaluation of seven ab initio methods that were evaluated on a non-homologous mammalian data set (Rogic et al. 2001) suggest that among the evaluated programs, Genscan (Burge and Karlin 1997) and HMMgene (Krogh 1997) were able to predict the precise locations of 70% - 80% of the coding exons with low false positives (Rogic et al ). But the inability of these programs to predict complete gene structure for each gene sequence has motivated researchers to investigate the benefits of combining predictions from two or more programs (Rogic et al. 2002; Murakami and Takagi 1998). These methods increase the accuracy of exon prediction by 5%- 10% (Rogic et al. 2002). However, more improvement is needed for these programs to be reliable in each single instance. The similarity search programs, such as BLASTX and Sim4 are very effective in improving the accuracy of gene prediction (Gish and States 1993; Florea et al. 1998). Similarities with three different types of sequences-proteins, cdnaiests transcripts and genomic DNA-can provide information about exon/intron locations (Mathe et al. 2002). Similarity search using programs like BLASTX detect similarities between genomic DNA and database sequences. A previous study suggests that more than 50% of newly sequenced vertebrate genes have similar sequences in the sequence databases (Claverie et al. 1997). However, these programs indicate only the approximate locations of many coding exons and does not identify every exon and do not accurately delineate exon boundaries (Gish and States 1993). Similarity can often be misleading as sometimes conservation in regions may only be in some part of coding exons or the similarity may extend to the introns and/or the UTRs regions of genes. It is worth noting that genes in the newly sequenced genomes are annotated based on similarity to known sequences (Lander 48

2 (}ene Precfiction et al. 2001; Waterston et al. 2002). Some of the important steps in eukaryotic gene identification include, locating the protein-coding regions, identification of start and stop signals, locating the exon-intron splice sites (donor sites) and intron-exon splice sites (acceptor sites). Here, an attempt is made to develop a consensual gene-prediction method that integrates multiple sources of evidences from ab initio predictions, similarity search approaches against available proteins and intron sequences and evidences from splice site predictors. 4.2METHODS Eukaryotic gene data sets Two different data sets were used for evaluation of the existing gene finders and the method developed during the course of present work. These are the HMR195 data set containing 195 human, mouse and rat sequences with 43 single exon genes and 152 multi-exon genes and the Burset/Guig6 data set containing 570 vertebrate sequences containing only multi-exon genes. The details about both the data sets can be obtained from Section Ab initio gene-finding programs In the present study, five ab initio programs were used based on the evaluation of their performance on a non-homologous mammalian gene data set (Rogic et al. 2002). Of these two, Genscan and HMMgene are separate methods while other three are combination methods that rely on predictions from these two ab initio programs. A brief description of the programs is given below Genscan (Burge 1997; Burge and Karlin 1997) In this program, the structure of the genomic sequence is modeled by explicit state duration HMM; the states of which are probabilistic models themselves. In this method the signals sites are predicted using the weight matrices, weight arrays, and the maximal dependence decomposition (Burge 1997). Genscan model can predict the absence of genes or the presence of a single gene or multiple genes, which can be either complete or partial. It also has the option to predict suboptimal exons, which are defined as potential exons with a probability higher than a certain threshold but which are not contained in the optimal parse of the sequence. This type of exon can potentially represent alternatively 49

3 qene Prediction spliced exons. Genscan was trained on Kulp and Reese's data set (1996) of human genomic sequences, and an additional set of 1999 human edna sequences was used for training the coding region HMM. The maximal length of the input sequence for this version of Gens can is 200 kb. The output of Gens can is similar to the output of the other programs, giving information about exon location and their probabilistic score, but scores for other sequence features such as splicing sites are also given. A latest version 1.0 of the Gens can program was obtained from the developers and installed on a local machine HMMgene (Krogh 1997) The program is based on HMM, and is trained using a criterion called conditional maximum likelihood, which maximizes the probability of correct prediction. If the sequence analyzed already has some sub regions identified (hits to EST or protein database, repeated elements), those regions can be locked as coding or noncoding and then submitted to HMMgene. The underlying gene structure model can predict both partial and complete genes in sequence and any number of them. The program has the option to give more than one prediction, which could indicate alternative splicing of the gene in the sequence. The data set of human single- and multi-exon genes collected by Kulp and Reese was used for the training of this program. The output is given in OFF format, slightly different from that used by Genie; it does not give the location of the splicing sites, but only of the exons, whose type is also specified. A latest version 1.1 d of the HMMgene program was obtained from the developers and installed on a local machine Exon Union-Intersection (EUI) method (Rogic et al. 2002) a. Consider all Genscan and HMMgene exons that have exon probability score greater or equal to a threshold Pth The regions predicted by at least one of the programs are labeled as EUI exons (Exon Union) b. Consider all Genscan and HMMgene exons that have exon probability score less than threshold Pth The regions predicted both the programs are labeled as EUI exons (Exon Intersection) Consequently, a Genscan or HMMgene exon that does not overlap any exon predicted by the other program will be accepted if its exon probability is greater or equal 50

4 (}ene 'ediction to Pth or refused otherwise. There is one exception to (a): if Gens can internal exon has the same right boundary (donor site) as the HMMgene initial ex on and both the exons have score greater than Pth, then HMMgene initial exon is chosen as EUI exon. This reflects the greater accuracy ofhmmgene in predicting initial exons (Rogic et al. 2002) Exon Union-Intersection with Reading Frame consistency (EUI Frame) method (Rogic et al. 2002) a. For each program's prediction determine the gene boundaries and to each gene assign a gene probability calculated as the average of exon probability scores for all the exons contained in that gene. For each predicted exon determine the position of acceptor and donor site in a reading frame of a gene it belongs to. b. If the gene predicted by the Genscan overlaps the gene predicted by HMMgene, choose the one with the higher gene probability to impose the reading frame. Apply the EUI method to the exons belonging to the selected genes accepting EUI exons only if they are in the chosen reading frame Gene Intersection (GI) method (Rogic et al. 2002) The GI-Frame approach is designed to identify genes in long genomic regions so as to eliminate false predictions typical of ab initio predictors in long genomic regions. a. For each program's prediction select regions predicted as genes. Regions predicted by both the programs are labeled as GI genes (Gene Intersection). b. Apply the EUI method to those exons that completely belong to the GI genes (here both exon boundaries are within a GI gene). Scripts implementing the EUI, EUI- Frame and GI methods were obtained from the web site of the developers at BLAST similarity search algorithm Similarity search against nucleotide and protein sequence databases constitute one of the important methods for identification and assigning of function for genes in genome 51

5 qene C 7ediction annotation pipelines like ENSEMBL (Curwen et al. 2004). BLAST is the most popular and regularly used method for similarity searches of sequences against databases (Altschul et al. 1990, 1997). One of the component programs of the BLAST tools, the BLASTX program has been successfully employed to identify likely protein coding sequences (Gish and States 1993; Yeh et al. 2001). BLASTX allows protein-protein comparisons when uncharacterized nucleotide query sequence is available. The program conceptually translates query sequences in all six reading frames (three on each strand) and compares each of these full-length translation products with a comprehensive protein sequence database in a single pass. BLASTX offers distinct advantages over related tools such as BLASTP, by combining the conceptual translation and database search procedures into a single step and by collating the results from searches with all six reading frames into a single report (Gish and States 1993). In the present study, BLASTX is used on query nucleotide sequence in a single pass on only the forward strand (three frames only). This is done to confirm with the predictions obtained from ab initio programs that are also obtained only for the forward strand Protein and nucleotide sequence databases In this study RefSeq (Pruitt and Maglott 2001) protein database was used for BLASTX search. Initially SWISS-PROT database (Boeckmann et al. 2003) was also considered but was not taken for extensive analysis since a truly representative protein database was required for this study (see Appendix II). RefSeq database provides a comprehensive, integrated, non-redundant set of sequences, including genomic DNA, transcript (RNA), and protein products, for major research organisms. RefSeq protein database includes only representative sequences from the following organisms: Drosophila me/anogaster, Danio rerio, Homo sapiens, Mus musculus, Rattus norvegicus and Sachharomyces cerevisiae. All the RefSeq proteins were obtained from ftp://ftp.ncbi.nih.gov/refseq/. To remove any unfair advantage to similarity based method over ab initio methods, all the RefSeq protein sequences that are coded by HMR195 data set sequences were removed from the two databases. Intron database (Sakharkar et al. 2002) was obtained for detecting potential intron regions that are then used for removing wrongly predicted exons. Intron database contains intron sequences derived from experimentally validated eukaryotic protein coding genes 52

6 qene Prediction recovered from GenBank eukaryotic entries. Intron sequences that were included in the HMR195 data set genes were removed from the Intron database to prevent unfair advantage to the similarity methods over ab initio methods during comparison NNSPLICE based splice site predictions In this step attempt has been made to detect position of start/stop codon and splice sites (e.g. intron-exon and exon-intron). The web-based NNSPLICE (Reese et al. 1997) program is used for detecting the splice sites in a region +/- 50 bp upstream and downstream of probable exon ends ( NNSPLICE employs separate feedforward neural network with one layer of hidden units to recognize acceptor and donor sites (Reese et al. 1997) Predictive accuracy measures The accuracy measures for evaluating the different methods used in this study have been previously reported by Burset and Guig6 (1996) and Rogic et al. (2001). Modifications were made in the measures to perform averaging over all sequences in the data set instead of averaging only over those sequences for which predictions were obtained. This prevents overestimation of specific accuracy measures. All the predictive accuracy measures at nucleotide and exon level were calculated as given below: Nucleotide level accuracy True positives (TP) are defined as the number of coding nucleotides predicted as coding; true negatives (TN) are defined as the number of non coding nucleotides predicted as noncoding, false positives (FP) are defined as the number of noncoding nucleotides predicted as coding, and false negatives (FN) are defined as the number of coding nucleotides predicted as noncoding. Accordingly, SEN (sensitivity)= TP, and SPE (specificity)= TP ~+ffl ~+H These are widely used measures of accuracy for gene prediction programs. Two measures that capture both specificity and sensitivity are the correlation coefficient ( CC) and approximate correlation (AC). However, AC has the advantage over CC since the 53

7 qene Prediction latter is not defined in cases where the subject sequence lacks either coding regions or noncoding regions. AC is defined by Burset and Guig6 (1996) as, Ac-({1( TP TP TN TN )} osj*2-4 TP+FN + TP+FP +TN +FP +TN +FN ( 4.1) Exon Level Accuracy Exon level sensitivity (ESEN) is defined as the proportion of actual exons that are accurately predicted as exon, while exon level specificity (ESPE) is defined as the proportion of predicted exons that are accurately predicted. Average of ex on level accuracy (EA VG) is defined as average of exons level sensitivity and exon level specificity. Usually this is used as a reliable measure of a program's exon level accuracy. To get a better estimate of prediction accuracy of the analyzed programs, we also computed the number of correct exons (CR), wrong exons (WE), missed exons (ME), partially correct exons (PC) and overlapping exons (OL) where, CR represents those predicted exons that are correctly predicted, WE are the noncoding regions predicted as exons, ME are the exons predicted as noncoding regions, PC are those predicted exons that have at least one end of exon correctly predicted, and OL are predicted exons that overlap actual exons. 4.3 RESULTS AND DISCUSSION Evaluation of existing programs Evaluation of the existing gene finders was carried on two different data sets, HMR195 and Burset/Guig6 data sets. The evaluation was performed on four different groups of programs including, ab initio methods, methods that combine ab initio algorithms, similarity approach alone and methods that combine predictions from ab initio and similarity search approaches. Each of these experiments is described below Ab initio gene finders The performance of the two ab initio methods Genscan and HMMgene programs on HMR195 and Burset/Guig6 data sets are shown in Table 4.1. The accuracy of the programs against each sequence in the data set is computed for each sequence in both the data sets and then averaged over the total number of sequences in the data set (see 54

8 qene Prediction methods). The performance of the ab initio programs achieved in the present evaluation are lower than that achieved by Rogic et al. (2001) for the same data sets since the performance in previous evaluation was achieved by averaging over only those sequences in which predictions were made by the ab initio programs. Interestingly, the performance achieved by the predictors on HMR195 sequences are lower than that achieved on Burset/Guig6 data set. This may be because of the following reason. The HMR195 sequences are recently available sequences while the Burset/Guig6 data set has 'older' sequences. Since it can be reasonably presumed that programs re-train their prediction models using available sequences to achieve better prediction results, the accuracy achieved by the programs on HMR195 sequences are bound to be lower than that achieved on Burset/Guig6 data set. Moreover, the programs studied here were developed after the generation of the Burset/Guig6 data set and therefore their training sequences may have some common sequences with the particular data set thereby leading to better performance. Though the performance should have been better in HMR195 data set due to the fact that sequences were derived from related organisms namely, humans, mouse and rat. This hypothesis should have held true since the Genscan and HMMgene programs were developed to predict human protein-coding genes and are expected to give better performance on sequences from humans and related organisms. In contrary, the sequences in the Burset/Guig6 data set were derived from all the vertebrate organisms whose sequences were available in GenBank database. However, result in Table 4.1 suggests the reverse, where performance on HMR195 data set is lower for both organisms, particularly at the exon level. This might suggest that the programs may have trained on related sequences that were available at the time of development but perform less efficiently on sequences that are newly sequenced, even if they are from organisms that they are specific for. 55

9 qene Prediction Table 4.1: Performance of ab initio programs on HMR195 & BursetJGuigo data sets PROGRAM No Nucleotide Level Exon Level Genes SEN SPE AC cc CR PC OL ME WE ESEN ESPE EAVG HMR195 GENSCAN HMMgene Burset/Guig6 GENSCAN HMMgene Only the forward (+) strand exons from default output of programs tested were compared to GenBank annotated exons for each sequence. The standard measures of predictive accuracy were averaged over all sequences in the data set: SEN, nucleotide level sensitivity; SPE, nucleotide level specificity; AC, approximate correlation; CC, correlation coefficient; ESEN, exon level sensitivity; ESPE, exon level specificity; EAVG, (ESEN+ESPE)/2; ME, number of missed real exons; WE, number of predicted wrong exons; CR, number of correctly predicted exons that are correct at both ends; PC, number of predicted exons that are partially correct; OL, number of predicted exons overlapping actual exons but end are incorrect; No Genes, number of sequences where no predictions were made by the programs. 56

10 qene Prediction Table 4.2: Performance of ab initio based combinatorial gene finding programs PROGRAM No Nucleotide Level Exon Level Genes SEN SPE AC cc CR PC OL ME WE ESEN ESPE EAVG HMR195 EUI EUI-FRAME Gl Burset/Guig6 EUI EUI-FRAME Gl Only the forward (+) strand exons from default output of programs tested were compared to GenBank annotated exons for each sequence. The standard measures of predictive accuracy were averaged over all sequences in the data set: SEN, nucleotide level sensitivity; SPE, nucleotide level specificity; AC, approximate correlation; CC, correlation coefficient; ESEN, exon level sensitivity; ESPE, exon level specificity; EAVG, (ESEN+ESPE)/2; ME, number of missed real exons; WE, number of predicted wrong exons; CR, number of correctly predicted exons that are correct at both ends; PC, number of predicted exons that are partially correct; OL, number of predicted exons overlapping actual exons but end are incorrect; No Genes, number of sequences where no predictions were made by the programs. 57

11 qene Prediction Ab initio based combinatorial methods In the next step, the performance of the ab initio based combinatorial methods was computed on HMR195 and Burset/Guig6 data sets. The combinatorial methods used here were the EUI, EUI-Frame and GI methods (see methods). The results are shown in Table 4.2. Similar to the results in Table 4.1, the combinatorial ab initio methods achieve a better performance on Burset/Guig6 data set at the exon level. Overall, the combinationbased ab initio methods predict a very low number of wrong exons (WE) while there is an increase in the number of missed exons (ME). It is observed that most of the partially correct exons (PC-exons whose one end is correctly predicted) and overlapping exons (OL-exons whose both ends are incorrect but overlap real exons) are converted to correct exons (CR) BLASTX based similarity search method An important study in development of any similarity based combination method is to compute the performance achieved by the similarity-based methods alone. BLASTX search against protein sequences constitute one of the important method to locate proteincoding regions in genomic DNA. This is more important due to the fact that the method is highly dependent on the data set that is used for similarity search. Use of most recent databases therefore would increase the performance of these methods considerably. The performance of the BLASTX on similarity search of HMR195 and Burset/Guig6 data sets against the representative database, RefSeq, was therefore calculated. Query genomic DNA sequence is searched against Ref Seq sequence database using BLASTX program. The expectation value (E-value) is kept at 1 to derive all probable strong and weak hits. Word length= 2 is chosen to make an extensive search and '-1' option is used to derive GI numbers of probable BLASTX hits from the database. Predictions are obtained only for the forward strand ( + ). Default parameters were selected for other options during BLASTX search (e.g., BLOSUM62). The results ofthe evaluation ofblastx algorithm on HMR195 and Burset/Guig6 data sets are shown in Table 4.3. The exon level accuracy achieved during the present evaluation is comparable to that achieved by Guig6 et al. (2000) on using default BLASTX parameters. The nucleotide level accuracy of the previous evaluation is higher 58

12 qene CFrediction than that achieved here due to the use of non-redundant (nr) protein database from National Center for Biotechnology Information (NCBI). While the result in Table 4.3 suggests poor exon level predictive ability of the similarity search algorithm, it would be prudent to remember that BLASTX not a gene finder per se. The default BLASTX strategy however produces reasonably high sensitivity (0.88 for HMR195 data set and 0.92 for Burset/Guig6 data set) at the nucleotide level to suggest that incorporation of the similarity search method at some level in other ab initio based methods would improve the accuracy of gene prediction. This has been the basis for many attempts in the past to integrate evidences from similarity searches against various databases into the effort to accurately predict protein-coding genes Method based on combination of ab initio and similarity based approaches In the next step, evaluation of GenomeScan, which integrates the predictions from BLASTX into the predictions from Genscan method (see section on Related Works in Chapter 2), was carried out. The GenomeScan is provided with the same set of top protein hits against Ref Seq database that is obtained in the present study on using BLASTX method. The results ofthe evaluation are shown in Table 4.4. On the HMR195 data set it achieves exon level sensitivity {ESEN) of 0.78, exon level specificity (ESPE) of 0.71 and average (EAVG) of Similarly, on the Burset/Guig6 data set, the performance achieved by GenomeScan for ESEN, ESPE and EAVG are 0.81, 0.78 and 0.80 respectively. The performance of Genome Scan on all the parameters at exon level is better than that achieved by Genscan program (Table 4.1). For HMR195 data set, the SEN (0.96) is higher for GenomeScan than that achieved by Genscan alone (0.93) or that achieved by ab initio based combination methods like EUI (0.92), EUI- Frame (0.92), and GI (0.84). However, the SPE (0.87) is lower for GenomeScan than other methods suggesting high number of false predictions. Lower number of missed genes ( 1) in case of Genome Scan in contrast to 3, 3, 3, and 5 by Gens can, EUI, EUI- Frame and GI respectively, suggest an increase in SEN at the cost of SPE. 59

13 qene Prediction Table 4.3: Performance ofblastx method on HMR195 & Burset/Guig6 data sets No Nucleotide Level Exon Level PROGRAM Genes SEN SPE AC cc CR PC OL ME WE ESEN ESPE EAVG HMR Burset/Guig Only the forward(+) strand exons from default output of programs tested were compared to GenBank annotated exons for each sequence. Table 4.4: Performance ofgenomescan method on HMR195 & Burset/Guigo data sets No Nucleotide Level Exon Level PROGRAM Genes SEN SPE AC cc CR PC OL ME WE ESEN ESPE EAVG HMR Burset/Guig The GenomeScan method combines evidences from BLASTX approach with predictions from ab initio method, Genscan. 60

14 qene Prediction Similar trend is observed in evaluation on Burset/Guig6 data set except that GenomeScan misses more genes (12) in comparison to Genscan program in contrast to that observed in case ofhmr195 data set (Table 4.1 and Table 4.2). These conflicting results may be due to limitations in the integration method employed by the GenomeScan method. These results suggested that further improvements could be made to the combination-based method for deriving more accurate predictions Development of a combinatorial method to predict eukaryotic genes Based on the advances in the efforts to accurately predict eukaryotic proteincoding genes and the evaluation of the most popular methods to identify genes in genomic DNA, an attempt is made in this thesis to improve upon the existing gene identification methods. The present work derives from the accuracy achieved by the ab initio methods and integrates evidence from the BLASTX based similarity search approach. The work further investigates the effect on accuracy of prediction after integration of evidences from similarity to intron sequences and predictions from accurate splice site predictors. This combinatorial method is developed into a highly integrated strategy called EGPred that effectively combines the multiple sources of evidences for accurate gene prediction A double BLASTX approach to identify probable homologous sequences Use of top BLASTX hits as a reference set to search for similar regions in a query genomic sequence has been effectively employed in the past (Guig6 et al. 2000; Yeh et al ). An attempt is made here to develop this strategy further by integrating it with other components of gene prediction. In this all proteins identified in the first BLASTX are extracted. A second BLASTX search of the genomic sequence against these limited protein sequences with relaxed parameters (E-value < 10) is used to retrieve all probable coding exon regions without masking low complexity regions. -l' option is used to provide the BLASTX program the list file containing the gene index (GI) numbers of all hits from the initial search. Similar to the first run for BLASTX, predictions are obtained for the forward(+) strand only. 61

15 qene Prediction To derive probable coding exons from BLASTX report we used the following steps; (1) all high-scoring segment pairs (HSPs) without any inframe stop codons or 4 consecutive gaps are considered as exons; (2) HSPs that have inframe stop codon are split at stop codon and the longest segment is considered as the ex on; (3) HSPs with 4 consecutive gaps are split at the gap and the longest segment is considered as ex on; ( 4) multiple overlapping BLASTX hits are pruned in a preprocessing step to keep only the strongest hit (lowest E-value). For determining the start and stop signals, the position of HSPs that are near then- or C- terminals when aligned along length of database protein sequence is used as template to predict the positions of these signals. To compute the significance of probable exons assigned above, we derive four different exon parameters-the length of exon; sequence identity/similarity; score; and E-value of the HSP. The sequence identity/similarity are recalculated if any exon is a sub -region derived from an HSP. The performance of the second BLASTX step is evaluated on HMR195 and Burset/Guig6 data sets. In the HMR195 data set, the second BLASTX increases the number of correctly (CR) predicted exons from 36 to 75 (Table 4.5), and the number of partially correct exons (PC) increases from 277 to 317. This also reduces the number of wrongly predicted (WE) exons from 1920 to 962. In contrary, the second BLASTX reduces the number of overlapping (OL) exons from 643 to 453, and increases the number of missed (ME) exons from 98 to 117. There is significant improvement in sensitivity (ESEN, from 0.04 to 0.18) and specificity (ESPE, 0.02 to 0.12) at the exon level. A similar trend was observed in the Burset/Guig6 data set, where the results achieved validate the consistency of rules derived on the HMR195 data set to be generalized and therefore equally effectively on independent data sets. The second BLASTX increases the CRs from 123 to 168 and PCs from 746 to 1050 (Table 4.5). The number of WEs is reduced from 5697 to However, OLs are reduced from 1737 to 1070, and MEs increase from 246 to 380. The EAVG at the exon level increases from 0.03 to This increase is significantly lower than that achieved on HMR195 data set where EAVG increases from 0.03 to A probable reason behind this might be the 62

16 qene C'Prediction representative sequence database, RefSeq, does not contain similar sequences from related organisms whose sequences are derived for Burset/Guig6 data set Effect of integrating evidences from the splice site predictions on performance of the BLASTX approach In this step, an attempt is made to detect the positions of start/stop codon and splice sites (e.g. intron-exon and exon-intron). The Web-based NNSPLICE program was used for detecting the splice sites in a region +/- 50 bp upstream and downstream of probable exon ends ( NNSPLICE employs separate feedforward neural networks (FFNN) with one layer of hidden units to recognize acceptor and donor sites (Reese et al. 1997). Positions other than the original site and having more probability of being a splice site, depending on the score from the neural network, are assigned as the correct splice site for that category (i.e. acceptor/donor). In cases of no score from the network, the original positions are assumed to be the correct sites. After reassigning the splice sites a filtration of exons from similarity method is done to remove candidate exons that are less than 30% in length of the aligned region of database protein sequence. A marginal effect is observed on the nucleotide level accuracy for both data sets on incorporating splice site information to the BLASTX exons (Fig. 4.1a). Modifications of the splice site positions using the NNSPLICE program significantly increase the ESEN from 0.18 achieved after second BLASTX to 0.59 in case ofhmr195 data set. Increase in ESPE (from 0.12 to 0.40) and EAVG (from 0.15 to 049) is also observed (Fig. 4.1 b). A similar trend was observed in Burset/Guig6 data set, where the ESEN increased from 0.06 to 0.62, ESPE from 0.04 to 0.48 and EVG from 0.05 to The 'improved' exons now provide a better set of candidate transcripts for the combinatorial experiment Effect of integrating evidences from BLASTN against intron sequence database on performance of the BLASTX approach A BLASTN search of genomic sequences against the intron database is run to identify potential intron regions in the query sequence. A low E-value < 1 o-so and a very stringent word length 15, are used to obtain more reliable intron hits to query sequence. Region of low complexity are masked during BLASTN searches. After the search, only introns above 60 bp in length (average minimum intron length) are considered. All the 63

17 qene Prediction probable exons and introns obtained from similarity searches were compared in order to identify overlapping exons and intron predictions. Different thresholds were studied for filtering exons using introns. Based on the data, all exons having 40% or more overlapping length with introns were removed from the list of probable exons. Boundaries of exons were reassigned if (1) only their terminal ends have overlapping introns, and (2) the overlap covers only 5% to 40% of exon length. Overlapping introns of below 5% of particular exon length are not considered since exon-intron/intron-exon boundaries are incorrectly defined by similarity methods HMR195 data set c=j Burset/Guigo ,I BLAST X (2nd cycle) I BLAST X + NNSPLICE BLASTX + INTRON BLASTX + INTRON + NNSPUCE Figure 4.1a: The performance of BLASTX method at nucleotide level on integrating intron similarity and splice site information. The black bars represent the performance on the HMR195 data set and white bars represent the performance achieved on the Burset/Guig6 data set. The first column in each set of black and white bars represents the Sensitivity (SEN) at nucleotide level; the second bar is the Specificity (SPE); third bar is the Approximate Correlation (AC); and forth bar is the Correlation Coefficient (CC) 64

18 qene c.prediction Incorporation of intron information reduces WE from 962 to 380, whereas ME is increased from 117 to 124 in case of HMR195 data set (Table 4.5). At the nucleotide level the SPE is increased from 0.72 to Similar trend is observed in case of Burset/Guig6 data set (Fig. 4.1 a). Figure 4.1b: The performance of BLASTX method at exon level on integrating intron similarity and splice site information. The solid line represents the performance on the HMR195 data set and dashed line represents the performance achieved on the Burset/Guig6 data set. EAVG is the average of exon level sensitivity (ESEN) and exon level specificity (ESPE). The improvement in EAVG on integrating intron information is more pronounced on integration along with splice site information. 65

19 q(me Prediction Table 4.5: Performance of BLASTX method on integrating intron similarity and splice site information PROGRAM HMR195 data set No Exon Level Genes CR I PC I OL I ME I WE BLASTX (2"d cycle) BLASTX+NNSPLICE BLASTX+INTRON BLASTX+INTRON+NNSPLICE Burset/Guig6 data set BLASTX (2"d cycle) BLASTX+NNSPLICE BLASTX+INTRON BLASTX+INTRON+NNSPLICE

20 qene Prediction Effect of integrating evidences from both splice site predictions and similarity to intron sequences This procedure was carried out in two steps. First the BLASTX predicted exons were filtered on similarity to intron sequences (as is employed in the section ). In the second step, the filtered exons are subjected to splice site predictions using the NNSPLICE program (as in section ). Evaluation of the integration method was carried out HMR195 and Burset/Guig6 data set. The results from this evaluation are given in Table 4.5. The incorporation of intron similarity information and splice site prediction significantly improves the ESPE (from 0.40 to 0.56) and EAVG (from 0.49 to 0.57) in case ofhmr195 data set. In case ofburset/guig6 data set, the improvement for ESPE is from 0.48 to 0.67 and for EA VG is from 0.55 to However, the change in accuracy for ESEN is marginal for both HRM195 data set (from 0.59 to 0.58) and Burset/Guig6 data set (from 0.62 to 0.64). At the nucleotide level, a large increase in the SPE is also observed for both data sets, though the number of missed genes increases in case of Burset/Guig6 data set from 0 to Combination of ab initio and similarity-based approaches For this study, the following ab initio programs were used---(1) Genscan, (2) HMMgene, (3) Exon Union-Intersection (EUI), (4) Exon Union-Intersection with Reading Frame consistency (EUI- Frame) and (5) Gene Intersection (GI) methods. All programs were used for prediction with the default parameters as derived by respective developers. Combination procedure The exons predicted from both ab initio methods and similarity search approach are divided into five groups. The first group contains all exons that are predicted exactly same by both approaches and are considered true exons. The second group contains exons from both methods where only the 5' end matches. The positions from ab initio predictions are replaced with positions from similarity predictions at 3' end on following conditions; (1) the ab initio methods predicts the exon as a 'last exon' type and similarity method predicts an overlapping 'internal exon' with donor site. The identity of exon from similarity-based method should be above 90% to database sequence and score of NNSPLICE prediction should be above 0.5 to allow the replacement of3' end; (2) the ab initio method predicts a 'single exon' type and similarity method predicts an 'initial 67

21 (iene Predlction exon' type with donor site. The 3' end is replaced if identity of similarity derived exon is more than 95% to database sequence and the score of donor site from NNSPLICE is above 0.9; (3) The ab initio method predicts a 'initial exon' type and the similarity method predicts a 'single exon' type with very high significance of E-value below 1 o- 200 ; (4) both the ab initio and similarity-based methods predict an 'internal exon' type and the NNSPLICE predicts a donor site with score above 0.5. The third group contains exons from both methods where only the 3' end matches. The position of 5' end from ab initio predictions is replaced on following conditions; (1) both the ab initio and similarity-based methods predict an 'initial exon' type and very high significance of E-value below 1 o- 50 ; (2) both the approached predict a 'single exon' type and exon from similarity method has very high significance of E-value < 1 o- 200 ; (3) both the approaches predict an 'internal exon' type and NNSPLICE predicts a acceptor site with score above 0.5; (4) the ab initio methods predicts an 'internal exon' type and similarity method predicts an 'initial exon' type with identity above 90% with database sequence; (5) the ab initio method predicts a 'single exon' type and the similarity-based method predicts an 'internal exon' type with high significance of E-value < and NNSPLICE predicts an acceptor site with score above 0.5. The fourth group contains exons from both methods that overlap each other with both ends being dissimilar. The positions of 5' ends of such exons are replaced with positions from similarity-based method on following conditions; ( 1) the ab initio method predicts an 'initial exon' and similarity approach predicts a 'single exon' type with more than 90% identity; (2) the ab initio method predicts a 'single exon' type and similarity approach predicts an 'internal exon' with high significance of E-value <10-50 and NNSPLICE predicts a acceptor site with score above 0.5; (3) both approaches predict 'internal exon' type an NNSPLICE predicts acceptor site with score above 0.5. The positions at 3' ends of overlapping exons are replaced with positions from similarity approach on following conditions; (1) both approaches predict 'internal exon' type and NNSPLICE predicts a donor site with score above 0.5; (2) both approaches predict 'initial exon' with NNSPLICE predicting a donor site with score above 0.5. The fifth group contains exons predicted only by similarity method (E-value< 1 o- 50 ) or by ab initio program (probability >0.5) alone are considered as true exons. For EUI, EUI- Frame and GI methods the probability score is not considered, as they are combination methods. 68

22 ()ene Predlction Rules for merging conflicting evidences Primarily, the predictions from the ab initio programs are used as template on which 'corrections' are to be made based on the level of confidence with which predictions are obtained from the similarity-based approach. Two important parameters from BLASTX that are considered valuable for determining the confidence of predictions are the Expectation-value (E-value) and the percent identity (PID) of aligned query to database sequence. Rules were derived for determining the threshold for using each of these two parameters on four different exon types--initial exons, internal exons, terminal exons and single exons. Steps for deriving these rules are as follows. (1) From each of two approaches (ab initio and similarity-based) predictions were obtained for the HMR195 data set. (2) The predicted exons from the ab initio (ab) programs and the similarity-based (sim) approach were grouped into four based on the exon type predicted -initial exons (I), internal exons (R), terminal exons (T) and single exons (S). (3) In each group, for every ab initio predicted exon, the exon type of the corresponding prediction from similarity-based approach is selected and categorized into sixteen sub-groups-- Iab!Isim, IaJRsim, IaJT sim, IawSsim, RaJisim, RaJRsim, RaJT sim, RaJSsim, T awisim, T ajrsim, TaJTsim, TaJSsim, SaJisim, SaJRsim, SaJTsim, SaJSsim (4) In each sub group, thresholds for E-value and PID were separately derived. (5) For deriving the thresholds in case of NNSPLICE program following procedure was used. In each sub group where the BLASTX predicted exons are not of the (S) type, the effect of increase in score from 0 to 1 is studied on accuracy of combination method. For each sub group, a separate threshold is obtained. In most sub-groups, a optimal threshold of 0.5 is obtained for consideration in the combination step except in case the BLASTX predicts an (I) type while the ab initio program predicts a (S) type where the optimal score for NNSPLICE is observed as 0.9 for affecting the combination. Evaluation of the combinatorial method The final combinatorial method is evaluated on HMR195 and Burset/Guig6 data sets. The evaluation is affected at two different phases. One, when only BLASTX derived exons are combined with ab initio predictions and secondly, when both BLASTX derived exon information and BLASTN derived intron information is used for effecting the combination with ab initio prediction. In both cases, NNSPLICE is used to modify splice sites of BLASTX exons prior to the combination procedure. 69

23 qene CJlrediction The results from the evaluation of the final combination procedure on the HMR195 and Burset/Guig6 data set are shown in Table 4.6a. On the HMR195 data set, the ESEN for Genscan increases from 0.70 to 0.81, the ESPE increases from 0.69 to 0.73, and the EAVG increases from 0.70 to 0.77 on incorporating similarity information against the protein database. ESPE further increases from 0.73 to 0.74 when similarity information from intron and splice sites is incorporated (Table 4.6b ). The improvement is at both the nucleotide and exon levels. It was interesting to note that there were no missing genes when similarity information is included, whereas original Genscan missed three genes. Similarly, no genes were missed by ab initio methods on inclusion of similarity information, unlike the original programs, where the numbers of missed genes were five, three, three, and 15 for the HMMgene, EUI, EUI-Frame, GI methods respectively. On inclusion of similarity information from proteins, the ESEN for HMMgene increases from 0.74 to 0.80 and ESPE increases from 0.75 to ESPE further increases to 0.78 on including similarity information from intron and splice site predictions. EAVG for HMMgene also increases from 0.75 to 0.80 on incorporating complete information from similarity searches and NNSPLICE predictions. A similar trend was observed for other programs EUI, EUI -Frame and GI where performance of all methods improves significantly at exon as well as at nucleotide level. On the Burset/Guig6 data set, the performance of Genscan on inclusion of similarity information from proteins improved significantly, with ESEN increasing from 0.78 to 0.84, ESPE increasing from 0.79 to 0.80 and EAVG increasing from 0.79 to 0.82 (Table 4.6a). The performance was further improved when information from splice site and intron search was included: ESPE increased to 0.82 and EA VG increased to 0.83 (Table 4.6b ). A similar trend was observed for HMMgene, where ESEN increased from 0.76 to 0.83, ESPE increased from 0.77 to 0.83 and EAVG increased from 0.77 to 0.83 when similarity information from proteins was integrated. As shown in Tables 4.6a and 4.6b, the performance of all other methods improved significantly particularly at exon level, when similarity information from proteins and introns were incorporated as described in this study. 70

24 qene Prediction Table 4.6a: Performance of ab initio programs on adding protein similarity information* PROGRAM HMR195 Data set No Nucleotide Level Exon Level Genes SEN I SPE I AC I cc CR I PC I OL I ME I WE I ESEN I ESPE I EAVG GENSCAN+BLASTX HMMgene+BLASTX EUI+BLASTX EUI-FRAME+BLASTX GI+BLASTX Burset/Guig6 Data set GENSCAN+BLASTX HMMgene+BLASTX EUI+BLASTX EUI-FRAME+BLASTX GI+BLASTX * BLASTX predicted exons were modified using NNSPLICE information prior to combination 71

25 qene Prediction Table 4.6b: Performance of ab initio programs on adding protein and intron similarity information* PROGRAM No Nucleotide Level Exon Level Genes SEN I SPE I AC I cc CR I PC I OL I ME I WE I ESEN I ESPE I EAVG HMR195 Data set GENSCAN+BLASTX+BLASTN HMMgene+BLASTX+BLASTN EUI+BLASTX+BLASTN EUI-FRAME+BLASTX+BLASTN GI+BLASTX+BLASTN Burset/Guig6 Data set GENSCAN+BLASTX+BLASTN HMMgene+BLASTX+BLASTN EUI+BLASTX+BLASTN EUI-FRAME+BLASTX+BLASTN GI+BLASTX+BLASTN * BLASTX predicted exons were modified using NNSPLICE information prior to combination 72

26 C}ene Prediction It has been shown that the performance of ab initio methods can be improved significantly if similarity search information is incorporated (Y eh et al ). In the present study, a systematic attempt was made to improve the performance of different ab initio methods using sequence similarity information. First, we evaluated the performance of sequence similarity approach on independent data sets. Initially, the performance of similarity search approaches was tested on SWISS-PROT and RefSeq databases. The specificity of BLASTX is found to be better on the Ref Seq database (see Appendix II). The reason for this is that RefSeq is a representative database, unlike SWISS-PROT, which is an all-inclusive, nonspecific database. Many false hits are thus avoided by using RefSeq database. However, results shown in Table 4.6a and 4.6b suggest that the present RefSeq database size may not be sufficient for all query sequences, and additional proteins from more organisms are required to make the database complete. In the present study, the concept of detecting introns by searching query sequence against intron database (Sakharkar et al. 2002) using BLASTN is introduced for the first time. These potential introns are compared with probable exons obtained from similarity search in order to filter/remove wrongly predicted exons. Surprisingly, the incorporation of intron information reduces the WE significantly. Intron sequences are thought to have little conservation among them. However, it is also clear that coding exon regions will have very negligible or no similarity with intron sequences. This may be due a variety of factors including codon bias, dinucleotide frequency bias, or hexamer bias etc. This study attempted to use these differences between coding exon and intron regions to filter spurious exons. However, it would be logical that any advantage gained from intron information will be highly dependent on related sequences being present in the intron database. It was observed that the performance of ab initio methods improved at nucleotide level but decreased at exon level when we incorporated exons obtained from similarity search. This is because exons from similarity search approaches do not predict correct exon-intron and intron-exon boundaries. However, it was seen that the performance of similarity-based method (with protein and intron) improved significantly (EA VG increases from 0.16 to 0.57) at exon level when splice sites were predicted using 73

27 C}ene Prediction NNSPLICE program (Table 4.5). Thus, EGPred strategy uses NNSPLICE and BLAST search against protein database and intron database. These results suggest that the increase in gene prediction performance is not uniform for all programs. The accuracy on incorporation of similarity information is dependent more on the accuracy achieved by the ab initio programs themselves (Table 4.6). Overall the accuracy of all ab initio programs improved significantly, particularly at exon level, where performance increases from 4% to 10%. It must be noted that there were no missed genes when similarity information was integrated with these ab initio methods, whereas the original programs missed genes from three to 15 in HMR195 data set. An important observation is that the rules derived for mergmg conflicting evidences are generalized for sequences with one-gene-per-sequence property. The results from Burset/Guig6 data set prove that the rules work effectively on sequences independent of HMR195 data set (Table 4.6b). For deriving maximum benefit on using the rules, an important requirement is that overlapping predictions from multiple sources should be available. Experimentally proven sequences with one-gene-per-sequence like those in HMR195 and Burset/Guig6 data sets have protein sequences in the databases that are similar to their protein products. Moreover, programs like NNSPLICE use available sequence information for training. This would suggest an evident bias of performance of EGPred towards sequences that have similar sequences available in the database Demonstration of the combinatorial method on Human Chromosome 13 The capability of modem gene finding algorithms need to be tested on long DNA sequences in order to cope with huge genomic DNA coming from sequencing projects. It is important to evaluate the performance of newly developed methods in realistic situation where DNA sequence consists multiple genes unlike the one-gene-per-sequence model. In order to demonstrate capability of EGPred, its performance was evaluated on the partial human chromosome 13 that was recently sequenced. Human chromosome 13 is the largest acrocentric human chromosome and is estimated to contain 633 genes with a total of 4266 exons excluding those from the pseudogenes (Dunham et al. 2004). An initial analysis has been performed as described in Methods. A total of bp was 74

28 qene Prediction analyzed in a region from to bp to predict genes. The sequence was obtained from The annotation available in the public domain for the segment of human chromosome 13 analyzed was obtained from /vega. sanger.ac. uk/homo _sapiens/ exportview. Since EGPred is critically dependent on its component program for accurate prediction, it is necessary to use programs that can predict multiple genes in long DNA sequences. Therefore, genes were predicted in chromosome 13 using EGPred methods that use Genscan and HMMgene programs in combination with similarity-based approach, as these ab initio programs are capable of predicting more than one-gene-persequence. EGPred by default predicts genes only in the forward strand. Therefore, the chromosome sequence is re-analyzed for reverse strand by reverse-complementing the entire genomic DNA before analysis using EGPred. Since EGPred method predicts gene only in the forward strand, the entire sequence was reverse-complemented and then used for prediction on reverse strand. The predictions for the reverse strand are computed from the last base towards the first base while predictions for the forward strand implies that position of exons are computed from the first base towards the last base. The annotation available in the public domain was compared against the predictions from the EGPred. The genes were predicted usmg two different combinations implemented in EGPred for Genscan and HMMgene methods. Application of these two strategies on human chromosome 13 produced four sets of putative genes (two for each strand), which is summarized in Table 4.7. A total of2125 multi-exon genes, 406 single exon genes and 2065 partial genes are predicted by the Genscan-based EGPred strategy. HMMgenebased EGPred strategy produced 4000 multi-exon genes, 220 single-exon genes and 2705 partial genes. The average number of exons per gene is also listed in Table 4.7. The total protein-coding region of the predicted exons covers little over 1 % in each direction for both strategies out of the total chromosome segment analyzed (Table 4.7). Out of 10,680 exons predicted by the Genscan-based EGPred strategy 2427 exon match those from the publicly available exon annotation data. Similarly, out of 14,251 exons predicted by HMMgene-based EGPred strategy only 1961 exons match the publicly available exon 75

29 qene (]lretfiction annotation data. Interestingly, all the exons predicted by the similarity-based approach were also predicted by the ab initio approach (Table 4.7). However, this does not suggest that only the ab initio method need to be used for predictions. Many of the ab initio predicted exons are modified based on evidences from similarity-based approach. Moreover, as the amount of known proteins is presently limited and the accuracy will increase further with increase of available data. The large number of predicted partial genes in Table 4.7 from both strategies reflects the fragmentation of genes in the sequenced chromosome segment. This fraction is likely to be decreasing with more availability of sequence data. Some of the gene structures are likely to reflect the alternatively spliced isoforms. However, the mechanisms underlying alternative splicing are not well understood and computational analysis of this phenomenon will require more specialized tools than those reported here. Table 4.8 shows the comparison of EGPred predictions with available public domain annotation of human chromosome 13. Surprisingly, more than 70% of exons predicted by the EGPred are not reported in the annotation. Since EGPred uses similarity to protein sequences, a large fraction of predicted genes are likely to be protein coding. However, results in Table 4.7 suggest that all predictions from similarity-based approach are also predicted by ab initio approach. However, a considerable proportion of genes are estimated to be absent from the current databases therefore the predicted genes may also have potentially novel protein-coding genes. A direct one-to-one comparison of predicted exons from both the EGPred-based strategy reveals that almost all exons predicted by one method are also predicted by the other method (Table 4.9). A large number of overlapping predictions points to the need to develop a more accurate splice site prediction program. 76

30 qene Prediction Table 4. 7: Summary of EGPred predictions on Human Chromosome 13 (1) (2) I (3) Genes I (4) I (5) (6) Mbp G (+) (45.2) (9.1) (45.6) (1.16) Mbp H (+) (57.6) (3.5) (38.9) (1.18) G Mbp (-) (47.3) (8.6) (44.1) (1.01) H Mbp (-) (57.9) (2.9) (39.2) (1.03) I (7) I (8) I Exons (9) I (10) I (11) I (12) 1-Programs; 2-Multi-exons genes; 3-Single-exon genes; 4-Partial genes; 5-Total predicted genes; 6--Total predicted exons; 7-Exons-per gene; 8-Total exon length; 9-Exons predicted by ab initio approach only; 10-Exons predicted by similarity search only; 11-Exons predicted by both approaches; 12-Number of matches to annotated exons; G(+)-Genscan positive strand; G(-) -Genscan negative strand; H(+)-HMMgene positive strand; H(-)-HMMgene negative strand. The numbers in parentheses in column 2-4 denotes percentage of total predicted genes, in column 8 it denotes percent of total analyzed nucleotide sequences of human chromosome 13. Column 12 includes exact, partial and overlapping matches. 77

31 (jene Pretfiction Table 4.8: Comparison ofegpred predictions on human chromosome 13 against available public domain annotation Total Available Both One Overlap Novel Programs predicted annotated ends end match exons exons exons match match G (+) G (-) H (+) H (-) Table 4.9: Comparison ofgenscan- and HMMgene-based EGPred predictions Both ends One end Overlap match match match No match G (+) vs H (+) G (-) vs H (-) H (+) vs G (+) H (-) vs G (-)

32 (jene Precfiction The physical location of all genes predicted by the EGPred method along the human chromosome 13 DNA sequence for both strands are available from While EGPred is demonstrated to be reliable for both short and long DNA sequences, the success of the program is critically dependent on the accuracy of underlying programs and continued improvements in gene prediction algorithms should improve future EGPred results. The reliability of EGPred to predict genes in large DNA sequences is also proved by the demonstration of two different EGPred strategies on a region of ~95Mbp long human chromosome 13. Initial results from the experiment suggest a much more number of genes than that is available in the public domain. This difference may be due to a variety of reasons, including the fact that the manual annotation is the product of all curated predictions based on alignment to all publicly available expressed sequences and application of gene prediction algorithms (Dunham et al. 2004). The manual annotation is therefore a conservative estimate of actual number of genes that may be present in the chromosome 13. While the actual expression of the EGPred predicted genes would be proven only through much experimental work, a fraction of these predicted genes are supported by known protein sequence data and therefore are more likely to be proteincoding genes. However, it is emphasized that users should consider the predictions on human chromosome 13 as hypothetical until experimentally proven. The data shown in Tables 4. 7 and 4.8 suggest that more than half of the manually annotated exons are missed by the EGPred method, whereas it predicts a large number of unannotated or possibly novel exons. An initial survey of these 'new' exons suggests a high false positive rate, integral mainly to the ab initio methods-genscan and HMMgene-used in EGPred. However, removal of the number of genes proportional to fraction of wrong exons (WE) reported for these individual ab initio methods still results in a high number of unaccounted exons. Because the predictions in EGPred from ab initio methods are filtered based on high confidence scores, it is more than likely that these novel exons are protein coding. One of the reasons that EGPred detects only half of manually annotated exons may lie in the use of only protein sequence data for the present combination-based predictions, whereas manual annotation use evidence from multiple sources including ESTs and edna apart from protein sequences. This would 79

33 qene Prediction suggest the need for integrating evidences from these additional sources and other gene prediction programs as well. The incorporation of recently published methods such as Combiner program (Allen et al. 2004) that integrates multiple gene prediction programs and evidences from edna, ESTs, proteins and splice site predictors may further improve to EGPred predictions. Combiner makes use of three different strategies for effective combination of predictions from more than one gene prediction program. Two of the Combiner strategies are Linear methods (LCl and LC2) that use an equal or unequal voting function with use of subsequent Dynamic Programming (DP) algorithm to construct gene models from different inputs. The third method, a decision tree-based nonlinear statistical combiner (SC) model, uses confidence scores to for combination of predictions from more than 2 gene finders. Incorporating one or more of these methods into EGPred algorithm would provide a strategy for combining predictions from more than one gene finder. 4.4 LIMITATIONS The information flow during transcription and translations events in a cell is not static or determined by a simple, single method as the predictions from gene finding programs suggests. There are several variants to how the information is passed from DNA to RNA to Protein. Some of the possible events include alternative splicing, use of nonconsensus splice sites, exon skipping (where a possible second transcript from a gene does not include one or more exons that are included in the first transcript from the same gene), and nested genes (a gene that is present inside intron of another protein-coding gene). Similarity evidence from protein database could possibly provide valuable information regarding alternative splice sites and use of non-consensus splice sites in a gene. One logical method to solve these problems would be to keep the complete 'global transcript' information of all protein hits from a database similarity search. From such information, it is possible to derive overlapping hits to probable exon regions. The differences in the length and signal content of overlapping exon regions from different transcripts would easily provide information regarding complex events such as use of alternate translation start site (TSS), and alternate splicing sites. Use of non-consensus splice sites is also easily identifiable with the similarity-based approach against protein database. Similarly, nested genes could possibly be identified using a negative evidence (intron) approach. Any such gene that will be completely masked by similarity to intron 80

34 qene Precfiction sequences can be considered as a probable nested gene. Since protein-coding regions are usually conserved, all exons from such a smaller nested gene should show similarity to a single intron sequence (if available) in a database. Similar strategy can also be used to identify events like exon skipping. Multiple transcripts obtained from similarity searches against protein sequences could provide information on whether a particular exon is not being included in one or more transcripts. Alternatively, complete masking of a probable exon (with strong supporting evidences) from a protein similarity-based transcript by an intron sequence would confirm the probability of the occurrence of an exon-skipping event. We are presently continuing work in many of topics mentioned here to further improve the prediction through logical use of multiple transcript information from the similarity searches against protein and intron sequences and also to improve the information content of the output in this respect. 4.5 DEVELOPMENT OF A WEB BASED SERVER, EGPred A web server, EGPred, was developed to predict the genes in eukaryotic genomic DNA using approach described in this study. The Web-based server is available from The Fig 4.2 shows the home page of EGPred Web-server. The server allows users to paste or upload a nucleotide sequence in FASTA format. Although, by default server uses the parameters that show best performance as in the present study, it also allows the users to change the various parameters. The parameters that users are allowed to change include the organism type for Genscan program, the organism type and probability score cut-off of predicted exons for HMMgene program, the E-value cut-off, protein database that is searched against, matrices and word length used during BLASTX searches. It is possible that large sequences may take substantial time for gene prediction. For this users are able to obtain results via . EGPred presents the results in graphic format as a GIF image where it shows the genes predicted by different strategies along the length of query sequence. Figure 4.3 shows graphical output from EGPred for a mouse sequence (GI:X07625; Locus ID:MMPROTl) on default parameters. Exons predicted by different method are represented by different colors. In addition to the graphic format the server present results in standard gene-finding format (GFF). The annotation shows the start of exon, stop of 81

35 qene CJlreatction exon, and type of exon predicted. The exon type can be any of the following-initial exon, internal exon, terminal exon and single exon. The exons are modeled into gene structures based on the order of their physical occurrence along the sequence. The possible gene structures are single exon genes, multi-exon genes and partial genes. Partial genes do not have complete gene structure. Biolnformatlcs Centre lnstthrta of Microbial Technology Chandlgarll, INDIA EGPH.ED SEHVER.. - AEl.vE AESaLTS SI4ct ylnfemwtion u JUM " COIITACT L\lfH' I! JOOLS GWFASTA IU:cent trend in gme prediction is to combine similarity information wilh ab i1rilio predictions (Ashurst and Colbns 2003). Presence of similar sequences in dalabses increases!he probability of query sequences being correctly predicted. Combinalion of ab initio gme predictions wilh similarity information has prmously ~d in improvement of one aspect of prediction accuracy. The increase in coverage or sensilivily is more or less of!set by 1he redudion in specificity. EGPred implea=ts a new ~gy where the spec:ificily is maintained while 1he sensilivily is increased on combining ab initio predictions wilh BI.l\STX aons that are fillered using criteria such as leng!h of HSP alignment, percent idenlily, score, expectation value, and similarity to inlron sequences. This ~gy increases!he sensilivily &om 0.95 to O.'J7 for Genscan. &om 0.93 to 0.96 for HMMgene, &om 0.94 to 0.96 for EUimethod, &om 0.93 to 0.96 for EUI-Frame method, and&om 0.91 to 0.95 for GI method on HMR195 dataset using default values. The spec:ificily at!he uucleotide level remain unchanged due to the combinalion procedure, ~hereby fulfilling a major requirement of researcl>ors for a more accurate predictions &om combinalion methods. A comprehensive and detailed evaluation of!he EGPred server is available elsewhere. Send Comments and Suggestions to: raghava@imtech.res.in If you use EGPred, please site: Figure 4.2: Home page ofegpred server 82