The Identification of Human Cassette Exons based on SVM

Size: px
Start display at page:

Download "The Identification of Human Cassette Exons based on SVM"

Transcription

1 I.J. Engneerng and Manufacturng 2011, 1, Publshed Onlne February 2011 n MECS ( DOI: /em Avalable onlne at The Identfcaton of Human Cassette Exons based on SVM Lan Tao a, Yanmeng Xu a,*, Huaku Chen a, Zexuan Zhu a a College of Computer and Software, Shenzhen Unversty, Shenzhen, P.R. Chna Abstract Alternatve splcng s the man mechansm expandng transcrpt dversty. Cassette exon s an mportant alternatve splcng form that s very smlar to consttutve exon n sequence features. Prevous studes whch based on expressed sequence tags (ESTs) and evolutonary conservaton nformaton have dentfed many alternatve splcng events. In ths paper, we construct a classfer to dentfy the human cassette exons based on Support vector machne (SVM) whch only make use of sequence nformaton. It can acheve the accuracy of 68.12%. Especally, the classfer can acheve 76.54% when consderng the splcng frequency. The results show that the senstvty and specfcty of ths method are hgher than those recently reported on the same dataset. Index Terms: Cassette exon;consttutve exon; Support vector machne (SVM);Sequence features 2011 Publshed by MECS Publsher. Selecton and/or peer revew under responsblty of the Research Assocaton of Modern Educaton and Computer Scence. 1. Introducton Alternatve splcng s an effcent mechansm for the generaton of transcrpt dversty. The genome-wde analyss of alternatve splcng ndcated that approxmately half of human genes have alternatve splce forms and 74-88% of alternatve splces change the proten product [1]. There have been several knds of software for gene predcton, such as GenScan, GeneMark, HMMgene and GeneID. But n these programs, alternatve splcng events are not consdered. Although sequencng of expressed sequence tags (ESTs) and mcroarray analyss are effectve expermental methods for dentfcaton of alternatve splcng events, the work also needs computatonal approaches, because of the expermental lmtatons. Another method of dentfyng alternatve splced events s based on the features of sequences and conservaton nformaton of the human and mouse genome [2-4]. But t also has defects because of some exons or splce stes are lackng of homology nformaton. * Correspondng author. E-mal address: xuyanmeng2005@yahoo.com.cn

2 58 The Identfcaton of Human Cassette Exons based on SVM The recent studes have classfed alternatve splcng nto seven dfferent types. Of all these dfferent splced forms, Cassette exon was consdered to be the most common one whch also called exon skppng [5]. A cassette exon s one form and s defned as one that s splced n some transcrpts but entrely not n others of the same gene. We amed at developng a method dentfyng cassette exons based on machne learnng. A gven human DNA sequence can be predcted that whether t s a cassette exon or not by ths method. The method we propose uses no ESTs or conservaton nformaton. It doesn t need any other nformaton than the sequence. Because of the resemblance between cassette exons and consttutve exons, t s stll a dffcult work to dentfcaton of cassette exons accurately wth ab nto method that we amed to. In the work of Yang and Sun [6-7], they have made some efforts. The best result can acheve the accuracy of 61% [6]. As the prevous studes showed that the selecton of features s very mportant n the dentfcaton of cassette exons. In order to mprove the predcton accuracy, the prevous works have proposed many useful characters [4]. Among all these features, exon length, 3-tuple counts, the nformaton from the splce stes, GC content of ntronc flanks and the ntensty of poly-pyrmdne tract (PPT) can be used n an ab nto method and found to be very mportant n dentfyng cassette exons. Besdes that, the Heterogenety Index (HI) of nuclec acd sequences s also found to be very remarkable between the two types of exons [8]. 2. Materal and Methods 2.1. Dataset The sequence of human cassette and consttutve exons are extracted from the AltSplce (human release 3) database [9]. All of these exons are n GT-AG form. In ths dataset, consttutve exons are supported by at least 20 ESTs. Especally, all the cassette exons used n ths paper are smple cassette exon (SCE). For each canddate exon n the dataset, we selected 3 parts of sequence: (1) the last 80nt of the upstream ntron flankng the 3 splce ste, (2) the whole exonc sequence, (3) the frst 80nt of the downstream ntron flankng the 5 splce ste. The ultmate dataset contans 3186 cassette and consttutve exons whch also called dataset D1. Because of the unbalance of the two types of exon data, 3125 consttutve exons are selected randomly from D1. At last, the tranng and testng datasets were dvded randomly n the proporton of 4:1. The testng datasets s ndependent completely. Table 1 shows the dstrbuton of exons n detal n the experment. Table 1 The dstrbuton of exons 2.2. SVM Exon Class Tranng Set Testng Set Cassette exon Consttutve exon As a wdely used machne learnng method, Support Vector Machnes was performed n the dentfcaton of splce stes and alternatve splcng events wth good performances [3, 7]. The method s based on the prncple of rsk mnmzaton and the geometrc margn maxmzaton. The features are mapped to a hghdmenson space by a kernel functon. Then a hyper plane n the new space s traned from the tranng data to perform the classfcaton. In ths paper, the wdely adaptable radal bass functon (RBF) kernel was selected to tran the model n classfcaton:

3 Rate The Identfcaton of Human Cassette Exons based on SVM 59 K( x x ) exp( x x ), 0 Of the RBF kernel, two parameters (C, γ) are very mportant. We appled LIBSVM [10] to buld the classfer and performed a grd search to choose the best parameters. The LIBSVM s an ntegrated software pack for support vector classfcaton, regresson and dstrbuton estmaton. In addton, to ensure the classfer s not over ftted to the tranng data, we performed 5-fold cross-valdaton to obtan the best parameters for tranng process. 3. Sequence Feature Analyss The usage of dfferent feature could affect the predcton accuracy sgnfcantly [3]. To get a hgher accuracy, t s necessary to mne more nformatve features and combne them effectvely. Based on the prevous work, the followng sequence features would be tred to mprove the predcton accuracy through sequence analyss. All of the analyses are based on D Exon Length Through analyss, more than 97% of exon lengths are wthn 300bp. In ths work, the nterval [0,300] s dvded nto 10 equal subntervals. We calculate the number of exon for each subnterval and translate the number nto rato. The last result s showed n Fg cassette exon consttutve exon Length of exon Fg. 1 The dstrbuton of exon length Accordng to the Fg. 1, the proportons of short exon (<120bp) n cassette exons are more than the consttutve exons obvously. At the same tme, the proportons of the other exon whch length s more than 120bp are on the low sde. In addton, f the length of a cassette exon s multple of 3, all the codons n the cassette exon events wll keep unchanged except the two neghbor codons of the cassette exon s upstream and downstream. The exon called frame-preservaton exon. Otherwse, the exon called frame-swtchng exon. Actually, the proportons of the frame-preservaton exon n the two types of exons are 41.12% and 38.04%.

4 60 The Identfcaton of Human Cassette Exons based on SVM 3.2. K-tuple Counts In the work of [1], the 3-mer composton s ncrement of dversty (ID) was used to predcton alternatve splce stes and performed effectvely. But t s found to be that the 5-mer composton s ID features are more effcent for the dentfcaton of cassette exons n our work. The concluson s consstent wth the work of [11]. It can acheve the accuracy of 63% when only make use of the 5-mer s ID features through experment. For two sources of dversty: X :{ n1, n2,, n d } andy :{ m1, m2,, m d }, the ID s defned as: ID( X, Y) D( X Y) D( X ) D( Y) The more smlar two sources are,the smaller ID s. More detals about the contents of ID could be seen n [1]. Gven a DNA sequence, we can get the 5-mer compostons through a wndow whose length s 5. And the length of slde wndow s 1. The value of d s So we can get three dversty sources for each exon sequence. They are come from the upstream, the exon and the downstream sequence. The same s that the postve and negatve tranng sample can also generate three dversty sources separately. Therefore, for each exon sequence, we can get a sx-dmensonal feature vector Splce Stes Fg. 2 llustrates the algnments of the donor and acceptor stes between cassette and consttutve exons. It s made by the WebLogo tool [12]. Fg. 2 The WebLogo graph of splce stes

5 Proporton The Identfcaton of Human Cassette Exons based on SVM 61 As can be seen from the fgure, both the donor stes and the acceptor stes have hgh smlar performance. The donor stes are conservatve at the regon [-3, +6] and the acceptor stes at the regon [-24, +2]. Although the dfference s very small, we can stll fnd that the conservatve of splce stes n consttutve exons s slghtly hgher than n cassette exons. Experments shows that add the splce stes features can mprove the result of the predcton. Poston weght matrx (PWM) s tradtonally used to represent the sequence patterns stored n the local mult-algnment of functonally related molecular sequences [13]. Gven a PWM, for a sequence wth specfc length, the PWM scorng functon (SF) could be calculated. If the SF value s larger, the sequence s possble to be a pattern that the PWM represent. More detals about the contents of PWM and SF could be seen n [1]. In the tranng process, the postve and negatve tranng sample can generate two PWMs separately. For each canddate exon sequence, four SF values could be calculated wth the four PWMs accordng to the correspondng splce stes. So we get a four-dmensonal feature vector about the splce stes Heterogenety Index (HI) It s well known that base sequences n the proten-codng regons of DNA molecules exhbt a perod-3 behavor, whch s the bass of gene predcton. In the work of [8], perod-3 behavor of the cassette exons was studed by HI. The defnton of HI can be represented as: HI 3 4 N N N N NN 1 2 N Here N (=1, 2, 3, 4) represent the number of four nucleotdes: A, C, G and T respectvely. N N ndcates the length of the sequence. N (=1, 2, 3) ndcates the length of the three sub-sequences under three dfferent readng frames. And N s the number of nucleotde n the sub-sequence. For each exon, the value of HI s calculated by (3). It could be found that more than 97% of the HI values are wthn the nterval [0, 40]. We dvde the nterval nto 20 equal subntervals and calculate the proporton of exon number wthn the correspondng subnterval. The result s showed n Fg. 3., N cassette exon consttutve exon HI Fg. 3 The dstrbuton of HI values

6 62 The Identfcaton of Human Cassette Exons based on SVM It could be seen from the Fg. 3, the HI values of consttutve exons are much more stable than the cassette exons. In fact, the mean HI values of consttutve and cassette exons are and respectvely. So t can be concluded that the HI value s an effectve feature to mprove the predcton accuracy The Other Features As mentoned n the ntroducton, GC content of ntronc flanks and the ntensty of PPT can be used n an ab nto method effectvely to dentfcaton of cassette exons. In ths work, GC content n the upstream ntron and n the downstream denote the number of Gs and Cs n the -70 to -1nt wndow at 3 ss and the +1 to +40nt wndow at the 5 ss. So we can get a two-dmensonal feature vector. Besdes that, the ntensty of PPT s calculated as the number of Cs and Ts n a sldng wndow of 19nt of the upstream of ntron. The wndow sldes n the area of the acceptor stes upstream -40 to -5nt. The maxmum number of Cs and Ts s the ntensty of PPT we used n the experment. 4. Results Fnally, the features we chose for the method are as follows: Table 2 Features for the method Dmenson 1 The Exon s length Features 2 The exon s length s multple of 3 or not (1 or 0) 3-8 The ID values of 5-mer s composton 9-12 The SF values of splce stes 13 The HI value of exon GC content of ntronc flanks 16 The ntensty of PPT In order to evaluate the predctve capablty and relablty of the method, the senstvty (Sn), specfcty (Sp) and total accuracy (TA) are defned as: Sn TP / ( TP FN), Sp TN / ( TN FP) TA ( TP TN) / ( TP TN FN FP) Where TP denotes the number of the correctly recognzed postve samples, FN denotes the number of postve samples recognzed as negatve, FP denotes the number of the negatve samples recognzed as postve, and TN denotes the number of correctly recognzed negatve samples. By grd-searchng, the parameters we chose to tran the SVM are: = ,C=8192. Then the SVM was traned on the tranng data and one model could be got. Based on the model, the method s evaluated on the test datasets. The result s shown n table 3. We compare the predcton of cassette exons n our method wth [6] and [7].

7 The Identfcaton of Human Cassette Exons based on SVM 63 Table 3 The performance of predcton Method Sn Sp TA Length a Our Method [6] [7] As shown n table 3, our method can get the best senstvty and specfcty. It can acheve the best accuracy of 68.12% and the dmenson of feature vector s not long. It s only based on the nformaton of sequence, so t can be used wdely and easly. 5. Dscusson Although we mproved the predcton accuracy of cassette exons wth a new feature vector. The work of predcton s stll on the low level. In ths research, we try to make some efforts to analyss the cassette exons wth dfferent splcng frequency [14, 15]. An EST s a short sub-sequence of a transcrbed cdna sequence. In the database of AltSplce, each exon has an attrbute of ESTs number whch ndcates the splcng frequency. We dvde the 3186 cassette exons n D1 nto three classes accordng to the number of ESTs. They are L1 exon, L2 exon and L3 exon, and the correspondng ESTs are 1, 2 to 8 and more than 8. The proportons of the three classes n D1 are 34.62%, 30.37% and 35.34% respectvely. Base on the 16-dmensonanl feature vector, we add three experments usng the SVM and RBF kernel. The negatve class s the consttutve exons. L1 exon, L2 exon, and L3 exon are postve classes respectvely n the Exp1, Exp2 and Exp3. In the process of tranng, the rato of postve and negatve sample s 1:1. And the rato of tranng and testng datasets s 4:1 approxmately. By grd-searchng, we got the same parameters C= , = from the three experments. The remanng ndependent testng data are predcted by the correspondng model. The results are shown n table 4. Table 4 The results of three experments Experments Sn Sp TA Exp Exp Exp From the results, t can be obvously found that the predcton accuracy s decreased gradually wth the ncrease of the splcng frequency of the cassette exons. The tradtonal method ddn t classfy the exons and only take t as a sngle class. Ths paper argues that t s one of the reason that why the predcton accuracy s always on the low level. In ths paper, we present a method of dentfyng cassette exons based on SVM. Accordng to consder the splcng frequency, the method can acheve the accuracy of 76.54% n Exp1. The result s hgher than [6] and [7] about 12%. A recent study shows that the orgn of cassette exons nvolves the emergence of cassette exons followng exonzaton of ntronc sequences [16]. The evolutonary forces sequence change from alternatve splcng to consttutve wth the ncrease of splcng frequency. Our work also confrms the theory from another perspectve.

8 64 The Identfcaton of Human Cassette Exons based on SVM References [1] W. R. T. Yang, and Q. Z L, Predcton of Alternatve 5 /3 Splce Stes n the Human Genome, BoMedcal Engneerng and Informatcs,2008. vol. 1, pp [2] C. Ma, F. Y. Deng, H. Lu, and Y. H. Zhou, Accurate predcton of alternatvely splced cassette exons usng evolutonary conservaton nformaton and logtlnear model, Bonformatcs, 2009, pp [3] G. Dror, R. Sorek, and R. Shamr, Accurate dentfcaton of alternatvely splced exons usng support vector machne, Bonformatcs, 2005, vol. 21, pp [4] R. Snha, M. Hller, R. Pudmat, U. Gausmann, M. Platzer, and R. Backofen, Improved dentfcaton of conserved cassette exons usng Bayesan networks, BMC Bonformatcs, 2008, 9:477. [5] Y. W. Chu, F. R. Hsu, and M. K. Shan, Comparatve Analyss of Exon Skppng Patterns n Human and Mouse, Database and Expert Systems Applcatons, 2006, pp [6] W. R. T. Yang, Predcton of alternatve splce ste and exon skppng based on sequence nformaton, Inner Mongola: Engneerng College of Inner Mongola Unversty, 2008, pp [7] G. Su, Y. F. Sun, J. L, The dentfcaton of human cryptc exons based on SVM, Bonformatcs and Bomedcal Engneerng, 2009, pp, 1-4. [8] Y. Q. Xng, L. R. Zhang, L. F. Luo, Predcton of alternatve splcng stes of cassette exons and ntron retenton n human genome, ACTA BIOPHYSICA SINICA, 2008, vol. 24, pp [9] S. Stamm, J. J. Rethoven, L. Texer, et al, ASD: a bonformatcs resource on alternatve splcng, Nuclec Acds Res, 2006, pp [10] C. C. Chang, C. J. Ln, LIBSVM: a lbrary for support vector machnes, 2003, unpublshed. [11] C. Yan, Z. Z. Wang, Research on sgnal sequences analyss and related characters of gene splcng, Graduate School of Natonal Unversty, 2006, pp [12] G. E. Crooks, G. Hon, J. M. Chandona, et al, Weblogo: a sequence log generator, Genome Res, 2004, pp [13] C. Zhang, M. L. Hastngs, A. R. Kraner, and M. Q. Zhang, Dual-specfcty splce stes functon alternatvely as 5 and 3 splce stes, Proc. Natl. Acad. Sc. USA, vol. 104, 2007, pp [14] S. H. Zhao, J. Km, and S. Heber, Analyss of cs-regulatory Motfs n cassette exons by ncorporatng exon skppng rates, ISBRA, 2009, pp [15] S. Q. Song, X. P. Chen, Comparatve component analyss of exons wth dfferent splcng frequences, PLoS ONE 4(4): e5387. [16] E. Km, A. Goren, and G. Ast, Alternatve splcng: current perspectves, BoEssays, 2008, vol. 30, pp