A Novel Splice Site Prediction Method using Support Vector Machine

Size: px
Start display at page:

Download "A Novel Splice Site Prediction Method using Support Vector Machine"

Transcription

1 Journal of Computational Information Systems 9: 20 (2013) Available at A Novel Splice Site Prediction Method using Support Vector Machine Dan WEI 1,2, Huiling ZHANG 2, Yanjie WEI 2,, Qingshan JIANG 2, 1 Cognitive Science Department & Fujian Key Laboratory of the Brain-like Intelligent Systems, Xiamen University, Xiamen , China 2 Shenzhen Key Lab for High Performance Data Mining, Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences, Shenzhen , China Abstract We present a novel classification method for splice sites prediction using support vector machine (SVM). The method first represents input sequences by sequence-based features, including the information of the distribution of tri-nucleotides and the conserved features surrounding the splice sites characterized by Markov model. An F-score based feature selection method is then used to select informative features to improve the performance. Finally, SVM is employed to classify the splice sites with the selected features. Experimental results show that this method improves splice site prediction accuracy and performs better than the existing methods such as MM1-SVM, Reduced MM1-SVM and some other methods. Keywords: Splice Site; Support Vector Machine; Distribution of Tri-nucleotides; Markov Model 1 Introduction With the rapid growth of huge amounts of DNA sequence, identifying genes becomes an important task in bioinformatics [1]. In eukaryotic genomes, genes are usually not continuous, but consist of a set of exons. Exons are separated by intervening non-coding introns [2]. The boundaries between exons and introns are referred to as splice sites including donor and acceptor splice sites. To detect genes of eukaryote, it is important to accurately predict splice sites. DNA is a sequence of nucleotides represented by a four letter alphabet. The acceptor splice site is the intron-exon boundary with a consensus dinucleotide AG, and the donor splice site is the exon-intron boundary and the intron region starts with the GT dinucleotide [3]. Given that GT or AG dinucleotide occurs very frequently at non-splice-site positions, it is very hard to identify Project supported by the National Natural Science Foundation of China(No , No ), the Shenzhen New Industry Development Fund (No.CXB A), and the Science Technology and Innovation Committee of Shenzhen Municipality (No. JCYJ ). Corresponding author. addresses: yj.wei@siat.ac.cn (Yanjie WEI), qs.jiang@siat.ac.cn (Qingshan JIANG) / Copyright 2013 Binary Information Press DOI: /jcis6763 October 15, 2013

2 8054 D. Wei et al. /Journal of Computational Information Systems 9: 20 (2013) a true donor/acceptor splice site from a false splice site. Predicting splice site can be treated as two binary-classification tasks, one for acceptor site and the other for donor site. Several computational methods have been developed to predict splice sites [4,5]. These methods primarily look for consensus motifs or patterns in the surroundings of the splice sites. Although consensus segments on the junctions between introns and exons can help predict splice sites, it is not enough to accurately differentiate between introns and exons [6]. Studies indicated that tri-nucleotide preference [7] is useful in characterizing splice sites [8] and tri-nucleotide repeats are shown to be closely related with splice regulation [9]. This paper proposes a new computational approach to predict splice sites with high confidence. Combining the distribution of tri-nucleotides and the information of conservative segments, represented by the first Markov Model (MM1), we transform the candidate sequences into numerical feature vectors. After performing feature selection method to select informative features, SVM is employed to predict the splice sites. Experimental results show that our method produces an improved splice site prediction performance when compared to the existing methods such as MM1-SVM, Reduced MM1-SVM and some other methods. 2 Related Work In order to accurately predict splice sites, different methods have been developed, including Bayesian networks [1], probabilistic approach [10 15], artificial neural network (ANN) [16] and support vector machine (SVM) [2, 3, 17 19]. The probabilistic approach and SVM method are the more popular methods. The probabilistic approach computes the likelihood of the GT/AG dinucleotides to estimate the position specific distributions of splice sites. The weight matrix model (WMM) [10] and weight array model (WAM) [11] are earlier models in this category. The maximal dependence decomposition (MDD) model in Genscan [12] is a more complex model, and the maximum entropy model (MEM) [14] was built later to predict splice sites using a maximum entropy approach. MEM outperformed MDD on donor sites prediction. Recently, a length-variable Markov model (LVMM) [15] was proposed using the second order Markov model (MM2) and it is computationally expensive. Although MEM and LVMM achieved good performance, it is difficult to determine a threshold parameter which is required by them to distinguish the splice sites. SVM [20, 21] classifies the splice sites by converting candidate sequences into feature vectors. SVM performs well in splice site prediction [3]. Zhang et al. [17] predicted splice sites using SVM with a Bayes kernel (SVM-B). Recently, probabilistic models are used to capture the features of splice site sequences and sever as preprocessing steps for SVM to classify splice sites [18]. The performance of MM1-SVM appeared to be better than other methods. Moreover, the results of MM1-SVM are promising when compared with MDD [12] and Genesplicer [13]. In order to focus only on more important features, Reduced MM1-SVM [19] was developed using a subset of MM1 parameters as input for SVM to predict splice sites. MM1 relies mainly on the nucleotide dependencies between adjacent bases. The exon regions normally exhibit a period-3 behavior, and particular tri-nucleotides exist in the upstream of donor sites and downstream of acceptor sites. Exploring the properties of these special gene segments would provide help with characterizing the splice sites.

3 D. Wei et al. /Journal of Computational Information Systems 9: 20 (2013) Method In this section, we extract the distribution of tri-nucleotides and incorporate these features with probabilities parameters of MM1 to detect splice sites. We also use an F-score based method to select the most informative features. 3.1 Feature extraction We compute the distribution of tri-nucleotides to characterize the exon and intron regions and use the probabilistic parameters of MM1 to describe the sequential relationships between nucleotides Distribution of tri-nucleotides The gaps between the locations where tri-nucleotide occurs in the sequence are used to explore the sequence structure [22], and the formula is as follow, α r = 1 pos r pos r 1 (1 r m) (1) where pos r is the location of the r th occurrence of a tri-nucleotide t, pos 0 = 0. α r indicates the relative position of two neighboring tri-nucleotides in the sequence. {α 1, α 2,..., α m } allows us to find all subsequent repeats of t. β j is defined as a partial sum of α r, and is calculated by: β j = j α r (1 j m) (2) r=1 {β j } is a nondecreasing ordered set. As a discrete set, {β j } can be considered as a point partition, and then one can construct a discrete probability distribution Q = (q 1, q 2,...q m ) where q i = β i / m i=1 β i, and m i=1 q i = 1. The Pseudo-Entropy (PE) of discrete probability distribution is calculated by, m P E(q 1, q 2,...q m ) = q i e 1 q i (3) According to Ref. [23], the pseudo-entropy includes the information on the position in a sequence and the feature vector based on PE can be regarded as the distribution of tri-nucleotides for its corresponding sequence. For a DNA sequence, there are 64 distinct tri-nucleotides to be considered. These 64 dimensional feature vectors are denoted by (pe 1, pe 2,..., pe 64 ), where pe i indicates the feature representation of the ith tri-nucleotide. Combining the features from the upstream and the downstream segments, we generate the distribution of tri-nucleotides represented by 128 features, (pe 1, pe 2,..., pe 128 ). i= MM1 Markov model (MM) computes the probability of a sequence element based on the preceding elements [24] and can describe the sequential relationships between nucleotides with position specific probabilistic parameters. The set of parameters in MM1 is {P (s i s i 1 )}, where i = 1, 2,..., l. For

4 8056 D. Wei et al. /Journal of Computational Information Systems 9: 20 (2013) a sequence of length l, there are l 1 position specific probabilistic parameters. The estimation of the model parameter is given by the following formula: P (s i s i 1 ) = N(s i k,..., s i ) N(s i k,..., s i 1 ) (4) where N(s i k,..., s i ) means the occurrence number of s i k,..., s i in the training data set, and k is the order of the Markov chain, in this paper k = 1. The training sequences are used to train two different modes: a model for true sites, M T, and a model for false sites, M F. Once the MM is constructed, each sequence in training and testing data set is modeled both by M T and by M F, and the conditional probabilities of sequences can be used as inputs of the classifier. For each sequence, there are 2(l 1) conditional probabilities. 3.2 Feature selection for SVM method Support vector machine [20] maps data into a high dimensional space. Because the radial basis function (RBF) kernel gives good performance and is a popular kernel when using SVM to predicting splice sites, it is used in our experiments. Feature selection is every important for SVM method. Irrelevant or redundant features may decrease the prediction accuracy of the classification model. Using the average values of the true and false instance, F-score [25] gives a measure of the discriminative power of each feature and assigns a numeric score to each one of the features. We use the F-score method to select the informative features and remove low-scoring features. We divide splice site prediction into two classification problems: {donor site, non-donor site} and {acceptor site, non-acceptor site} classification. The classification algorithm, SVM with the distribution of tri-nucleotides and Markov model (DM-SVM), is outlined in Algorithm 1. Algorithm 1. DM-SVM Input: The candidate splice site sequences, {S 1, S 2,..., S N }, length of sequence, l Output: Labels of unknown sequences Steps: 1. For n = 1 to N do Compute vector values, (pe 1, pe 2,..., pe 64, pe 65,..., pe 128 ), according to Eqs. (1)-(3) for the upstream and downstream surrounding the splice site; Model S i with M T and with M F according to Eq. (4), and S i is represented by a vector of 2(l 1) conditional probabilities, (p 1, p 2,..., p l 1, p l,..., p 2(l 1) ); End for 2. Calculate F-score of each feature in (pe 1, pe 2,..., pe 64, pe 65,..., pe 128 ) of the training set, and calculate the average value of all F-scores as the threshold, pe τ ; 3. For each sequence, select the features from (pe 1, pe 2,..., pe 64, pe 65,..., pe 128 ) whose F-score is more than pe τ, and construct the vector (pe 1, pe 2,...); 4. Similar to step 2, calculate the average value of all F-scores as the threshold, p τ ; 5. Similar to step 3, construct the vector (p 1, p 2,...); 6. For each sequence, merge (pe 1, pe 2,...) with (p 1, p 2,...); 7. Apply SVM on the training sequences with sequence-based vectors to obtain the SVM prediction model, use the model to predict the splice sites of testing sequences.

5 D. Wei et al. /Journal of Computational Information Systems 9: 20 (2013) Results and Discussion 4.1 Data set The experiments were performed on the Homo Sapiens Splice Sites Data set (HS3D) [26] which can be downloaded from and the length of each sequence is 140 nucleotides. The data set contains 2796 true donor sites, 2880 true acceptor sites, false donor sites and false acceptor sites. First we construct the 1:1 dataset by choosing all the true splice sites and randomly selecting equal number of false sites. We also construct the 1:10 dataset, which contains all the true splice sites, randomly selected false donor sites, and randomly selected false acceptor sites. 4.2 Evaluation To measure the quality of the prediction results, we use sensitivity S n, specificity S p, and a global accuracy Q 9 [27] to evaluate the classification performance. Q 9 is independent of the class distribution in the data set and defined as follows: Q 9 = (1 + q 9 )/2 (5) where (T N F P )/(T N + F P ), if T P + F N = 0 q 9 = (T P F N)/(T P + F N), if T N + F P = [F N/(T P + F N)] 2 + [F P/(T N + F P )] 2, if T P + F N 0 and T N + F P 0 where T P, F N, T N and F P represent the number of true positives, false negatives, true negatives and false positives respectively. The larger the Q 9 is, the better the classification result is. For every experiment, a ten-fold cross-validation is used to estimate the effectiveness of classification. 4.3 Experimental results In order to test the efficiency of DM-SVM in splice site prediction, we ran it on both the 1:1 and 1:10 data sets. The classification results are compared with those of MM1-SVM, Reduced MM1-SVM, and MEM. For SVM method, we rely on the grid-search method to find the optimal parameters (the soft margin parameter C and the kernel parameter γ). For MEM, the results are obtained by adjusting the threshold between 0 and 10 with step size 0.5, a score used to distinguish the true sites from the false ones. The results shown in Table 1 for MEM method are the average scores from ten-fold cross-validation. In addition, we have compared DM-SVM with SVM-B and LVMM. The results of SVM-B and LVMM methods are taken from Ref [15]. The test results of different methods on the 1:1 and 1:10 data set are shown in Table 1 and Table 2. In Table1, DM-SVM outperforms other methods for both acceptor and donor splice site predictions. For acceptor sites, the Q 9 score of DM-SVM is obviously better than those of the other four methods. MEM and SVM-B perform very closely and better than Reduced MM1-SVM, followed by MM1-SVM in terms of Q 9 value. Moreover, DM-SVM has the best prediction performance in terms of sensitivity and specificity. The relative rankings of different methods according to

6 8058 D. Wei et al. /Journal of Computational Information Systems 9: 20 (2013) sensitivity and specificity are not as the same as the ranking of Q 9 value, this is because Q 9 is the global accuracy measure calculated from both sensitivity and specificity scores. Table 1: Performance of different methods for acceptor and donor sites prediction on the 1:1 data set Method Acceptor Donor S n S p Q 9 S n S p Q 9 MM1-SVM Reduced MM1-SVM SVM-B MEM DM-SVM For donor sites, DM-SVM produces the best result with a Q 9 value of The relative ranking of other methods according to the Q 9 value is MEM, SVM-B, Reduced MM1-SVM and MM1-SVM. This ranking is similar to that of all the five methods for acceptor site prediction. The Q 9 value of each method for donor sites is higher than that of the corresponding method for acceptor sites. This is in agreement with that presented by Zhang et al. [15] and can be explained by that the regions around donor sites are more conservative than the regions around acceptor sites, thus are easier to predict. In addition, DM-SVM also performs better than other methods in terms of sensitivity and specificity. Similar to the Q 9 value, the sensitivity and specificity of every method for donor sites are higher than those of the corresponding method for acceptor sites. There are some differences between our results of MM1-SVM in Table 1 and the result reported in Ref. [15]. Our numbers in Table 1 are slightly lower than those in Ref. [15], and this is due to several factors, such as the different length of sequences used for extracting the information, the different parameter values used in SVM, and different false splice sites used in MM1-SVM method. Since Reduced MM1-SVM method eliminated some less important features, the result of Reduced MM1-SVM is higher than that of MM1-SVM for acceptor and donor sites. Given that there are more false splice sites than true sites in real genome sequences, we test our method on the 1:10 dataset. We have performed MM1-SVM, Reduced MM1-SVM and MEM prediction on the 1:10 dataset, and the result is displayed in Table 2. In addition, the results of SVM-B and LVMM taken from Ref. [15] are included in Table 2 for comparison. Table 2: Performance of different methods for acceptor and donor sites prediction on the 1:10 data set Method Acceptor Donor S n S p Q 9 S n S p Q 9 MM1-SVM Reduced MM1-SVM SVM-B MEM LVMM DM-SVM

7 D. Wei et al. /Journal of Computational Information Systems 9: 20 (2013) For acceptor sites, again we observe that the best prediction performance is produced by DM- SVM in terms of Q 9 value, sensitivity and specificity. For acceptor sites prediction, the relative ranking of other methods according to Q 9 value is LVMM2, MEM, SVM-B, Reduced MM1-SVM and MM1-SVM. This ranking is consistent with the ranking on the 1:1 data set in Table 1. The performance of DM-SVM, MEM and Reduced MM1-SVM on the 1:10 data set is similar to that of the same method on the 1:1 data set in terms of Q 9 value. For instance, the Q 9 values of DM-SVM are and on the 1:1 and 1:10 data set, respectively. For donor sites, DM-SVM achieved the best prediction performance in terms of Q 9, sensitivity and specificity. According to the Q 9 value, MM1-SVM, Reduced MM1-SVM and SVM-B have comparative prediction performance, and MEM performs slightly worse than LVMM2 but better than these three methods. This relative ranking is similar to that of the 1:1 data set showed in Table 1. From Table 2, we again observe that the Q 9 value of every method for donor sites is higher than that of the corresponding method for acceptor sites. The same reason used for explaining this behavior in Table 1 also applies here. Overall DM-SVM exhibits better prediction performance in both the 1:1 and 1:10 dataset. DM- SVM is clearly better than MM1-SVM and Reduced MM1-SVM for predicting acceptor and donor sites. These demonstrated that the added distribution of tri-nucleotides could help improve the performance of the MM1 based method. Although DM-SVM only performs slightly better than LVMM2 for donor sites, the difference is obvious for acceptor site prediction. In addition, the LVMM2 method uses the second order Markov models and it is computationally very expensive due to the fact that more training data is needed to estimate the Markovian parameters. 5 Conclusions In this paper, we presented a new classification method, DM-SVM, which is able to effectively predict splice sites both for donor and acceptor splice sites. To improve the performance, DM-SVM maps candidate splice site sequences onto feature vectors with the distribution of tri-nucleotides and MM1 model parameters. DM-SVM then applies an F-score based method to choose the most discriminative features. Finally, SVM performs the classification task using the reduced features. DM-SVM learns the different features between exons and introns as well as the conserved features surrounding the splice sites. The experimental results showed DM-SVM can achieve better performance than MM1-SVM, Reduced MM1-SVM and some other methods. In addition, the proposed method may be extended for identifying other specific sites in the sequence. References [1] Chen TM, Lu CC, Li WH. Prediction of splice sites with dependency graphs and their expanded Bayesian networks. Bioinformatics 2005, 21(4): [2] Degroeve S, Saeys Y, De Baets B, Rouzé P, Van de Peer Y. SpliceMachine: predicting splice sites from high-dimensional local context representations. Bioinformatics 2005, 21(8): [3] Sonnenburg S, Schweikert G, Philips P, Behr J, Rätsch G. Accurate splice site prediction using support vector machines. BMC Bioinformatics 2007, 8 (Suppl 10): S7. [4] Mathé C, Sagot MF, Schiex T, Rouzé P: Current methods of gene prediction, their strengths and weaknesses. Nucleic Acids Res 2002, 30:

8 8060 D. Wei et al. /Journal of Computational Information Systems 9: 20 (2013) [5] Brent MR, Guigó R: Recent advances in gene structure prediction. Curr Opin Struct Biol 2004, 14: [6] Lim LP, Burge CB: A computational analysis of sequence features involved in recognition of short introns. Proc Natl Acad Sci USA 2001, 98(20): [7] Staden R, McLachlan AD. Codon preference and its use in identifying protein coding regions in long DNA sequences. Nucleic Acids Res 1982, 10(1): [8] Nikolaou C, Almiranits Y. Measuring the coding potential of genomic sequences through a combination of triplet occurrence patterns and RNY preference. J Mol Evol 2004, 59(3): [9] Parmley JL, Hurst LD. Exonic splicing regulatory elements skew synonymous codon usage near intron-exon boundaries in mammals. Mol Biol Evol 2007, 24(8): [10] Staden R. Computer methods to locate signals in nucleic acid sequences. Nucleic Acids Research 1984, 12: [11] Zhang MQ, Marr TG. A weight array method for splicing signal analysis. Comput Appl Biosci 1993, 9(5): [12] Burge C, Karlin S. Predictions of complete gene structures in human genomic DNA. J Mol Biol 1997, 268: [13] Pertea M, Lin X, Salzberg SL. GeneSplicer: A new computational method for splice site prediction. Nucleic Acids Research 2001, 29(5): [14] Yeo G, Burge C. Maximum entropy modeling of short sequence motifs with application to RNA splicing signals. J Comput Biol 2004, 11(2-3): [15] Zhang Q, Peng Q, Zhang Q, Yan Y, Li K, Li J. Splice sites prediction of human genome using length-variable Markov model and feature selection. Expert Syst. Appl. 2010, 37: [16] Rajapakse JC, Ho LS. Markov encoding for detecting signals in genomic sequences. IEEE/ACM Transactions on Computational Biology and Bioinformatics 2005, 2(2): [17] Zhang Y, Chu CH, Chen Y, Zha H, Ji X. Splice site prediction using support vector machines with a Bayes kernel. Expert Systems with Applications 2006, 30(1): [18] Baten AKMA, Chang BCH, Halgamuge SK, Li J. Splice site identification using probabilistic parameters and SVM classification. BMC Bioinformatics 2006, 7 (Suppl. 5): S15. [19] Baten AKMA, Halgamuge SK, Chang BCH. Fast splice site detection using information content and feature reduction. BMC Bioinformatics 2008, 9 (Suppl. 12): S8. [20] Vapnik VN: Statistical learning theory. Wiley, Chicester, UK; [21] Ye Q, Zhao C, Ye N. A New SVM Classification Approach via Minimum Within-class Variance. Journal of Computational Information Systems 2010, 6(1): [22] Wei D, Jiang QS, Wei YJ and Wang SR. A novel hierarchical clustering algorithm for gene sequences. BMC Bioinformatics 2012, 13: 174. [23] Li C, Ma H, Zhou Y, Wang X, Zheng X. Similarity Analysis of DNA Sequences Based on the Weighted Pseudo-Entropy. Journal of Computational Chemistry 2011, 32(4): [24] Durbin R, Eddy S, Krogh A, Mitchison G: Biological Sequence Analysis: Probabilistic Models of Proteins and Nucleic Acids, Cambridge University Press, London; [25] Chen YW, Lin CJ. Combining SVMs with various feature selection strategies. Feature extraction : foundations and applications 2006, 207: [26] Pollastro P, Rampone S. HS3D: Homo Sapiens Splice Sites Dataset. Nucleic Acids Research 2003 Annual Database Issue. [27] Zhang CT, Zhang R. Evaluation of gene-finding algorithms by a content balancing accuracy index. Journal of Biomolecular Structure and Dynamics 2002, 19(6):