Applying Machine Learning Strategy in Transcription Factor DNA Bindings Site Prediction

Size: px
Start display at page:

Download "Applying Machine Learning Strategy in Transcription Factor DNA Bindings Site Prediction"

Transcription

1 Applying Machine Learning Strategy in Transcription Factor DNA Bindings Site Prediction Ziliang Qian Key Laboratory of Systems Biology Shanghai Institutes for Biological Sciences, Chinese Academy of Sciences, China Zhisong He CAS-MPG Partner Institute of Computational Biology Shanghai Institutes for Biological Sciences, Chinese Academy of Sciences, China Yudong Cai Institute of Systems Biology Shanghai University, China 1 Machine Learning Approaches Gene transcription regulation is considered one of the major mechanisms affecting tissue-specific gene expression as well as environment change response Hence, it is important to understand the gene regulatory mechanism [Jamieson, 2003] Some of the major research topics include the following: (1) which genes will be expressed for specific environments, (2) the regulators for these genes; and (3) the details of transcription regulation, including the promoter binding sites and trans-activation mechanisms Gene transcription is a major research topic in the post-genomic era and is one of the key aspects of system biology Major studies cover the following: (1) discovering transcription factors systematically [Qian et al, 2006a] and (2) identifying transcription factor DNA binding sites [D Haeseleer, 2006; Fox, 1997; Qian et al, 2006b; Stormo, 2000] In this chapter, machine learning methods, namely, nearest neighbor algorithm (NNA) and support vector machines (SVM) are introduced to predict transcription factor classification and transcription factor binding site predictions Here we introduce two useful machine learning approaches: nearest neighbor algorithm (NNA) and support vector machine (SVM) Both can be used to solve classification problems and be applied to our topics 11 Nearest Neighbor Algorithm The NNA is one of the simplest statistics learning approaches at present [Cai & Chou, 2006; Chou & Cai, 2006; Jia et al, 2006; Yu et al, 2006], making predictions by searching the nearest neighbor within a wide dataset The query sample is assigned to be of the same class as with its nearest neighbor To conduct this search, defining the distance between two samples is necessary Some of the well-known distances are Euclidean, Hamming, and Mahalanobis distance

2 186 Chapter 13 Secure Outsourcing of DNA Databases 12 Support Vector Machines SVMs [Bhasin, 2006; Joachims, 1999] are a set of related supervised learning methods used for classification and regression In other words, given a set of training examples, with each marked as belonging to one of two categories, an SVM training algorithm builds a model that predicts whether a new example falls into one category or the other Intuitively, an SVM model is a representation of the examples as points in space It is mapped to divide the examples of the separate categories by a clear gap that is as wide as possible New examples are then mapped into that same space and are predicted to belong to a category based on which side of the gap they fall on Technically, an SVM constructs a hyperplane or a set of hyperplanes in a high or infinite dimensional space, which can be used for classification, regression, or other tasks Hence, good separation is achieved by the hyperplane with the largest distance to the nearest training datapoints of any class (so-called functional margin), as generally, the larger the margin, the lower is the generalization error of the classifier 2 Applying Machine Learning Approach to Transcription Factor Prediction Transcription factor (TF) is the protein that binds to specific DNA sequences, thereby regulating gene expression (transcription of genetic information from DNA to mrna) It performs this function either alone or with other proteins in a complex by promoting (as an activator) or blocking (as a repressor) the recruitment of the enzyme that performs the transcription of genetic information from DNA to RNA to specific genes Due to its important role, to understand the mechanism of transcriptional regulation identifying the TFs from the whole genome and classifying them into different groups is necessary Generally, TFs are always structurally modular with some special domains, such as the following: 1 DNA-binding domain (DBD) It can bind to specific sequences of DNA (enhancer or promoter sequences) adjacent to genes, enabling the TFs to regulate the genes DNA sequences that bind TFs are often referred to as response elements 2 Trans-activating domain (TAD) It can bind to other proteins such as transcription co-regulators These binding sites usually act as activation functions (AFs) 3 Signal sensing domain (SSD) (eg, a ligand binding domain) This domain is optional It senses external signals and transmits these signals to the rest of the transcription complex to regulate gene expression Moreover, DBD and SSD may reside on separate proteins that associate within the transcription complex to regulate gene expression TFs are generally grouped into four distinct classes: zinc-coordinating DNA-binding domains, basic domains, helix-turn-helix, and ]-scaffold factors Figure 1 shows the proportion of these four classes The proportions here were calculated based on TRANSFAC (v70) (details of this database can be seen below) 21 Transcription Factor Databases To facilitate transcription regulatory studies, biological databases have been set up to save transcription factor information and transcription factor DNA binding sites Here, we introduce two databases TRANSFAC TRANSFAC 70 Public 2005 [Matys et al, 2006; Wingender et al, 1996] contains data on transcription factors, their experi-mentally-proven binding sites, and regulated genes Its broad compilation of binding sites allows the derivation of positional weight matrices

3 Z Qian, Z He & Y Cai 187 Figure 1: Four classes of transcription factors Transcription factors are generally grouped into four distinct classes Left-up, zinc-coordinating DNA-binding domains; right-up, basic domains; left-bottom, helix-turn-helix; right-bottom, ]-scaffold factors The proportions here were calculated based on TRANSFAC (v70) JASPAR JASPAR ( is a collection of transcription factor DNA-binding preferences modeled as matrices These matrices can be converted to Position Weight Matrices (PWMs or PSSMs), which are often used for scanning genomic sequences 22 Machine Learning Approaches for Classifying Transcription Factors To apply the machine learning approaches mentioned above in solving the problem of transcription factor prediction, the first step is to transform this problem into one or several classification problem, that is, to tell the type to which it should be assigned given a sample Here, we categorized the problem into two individual classification problems: to determine whether a protein is a TF or a non-tf and to determine a TF to be in one of the four classes 221 Dataset Generation To construct predictors based on the machine learning approaches, training datasets are necessary After the issue into two distinct problems, the next step is to construct a training dataset for each of them First, we download the information on TFs from the TRANSFAC database [Attwood, 2002], which also contains their classification information As non-tf samples are likewise needed to train the predictors, we can randomly select protein samples from the UniProtKB/Swiss-Prot database by using the keywords membrane, secretory, antigen, transferase, and kinase to construct this negative dataset Overall, a dataset with 1176 transcription factors and 29,295 non-transcription factors can be built by using TRANSFAC v70 and UniProtKB/Swiss-Prot Release 493 We then refine the dataset as follows (1) Filter out proteins with a length over 5000 aa or less than 50 aa and those without SwissProt accession number (2) Remove the redundancy against homology bias using the programs cd-hit and PISCES Thus,

4 188 Chapter 13 Secure Outsourcing of DNA Databases Dataset Size Transcription factors (TF) Basic domain 21 Zinc-coordinate 15 Helix-turn-helix 36 β-scaffold 12 Overall 84 Non-TF 2167 Total 2251 Table 1: Original Dataset none of the sequences investigated has more than 25% sequence identity Finally, a positive dataset with 84 TFs with known classification information and a negative dataset with 2167 non-transcription factor proteins is obtained (Table 1) 222 Constructing a Feature Vector for Eeach Sample When applying the machine learning approach, we use a feature vector to represent each sample This feature vector can present the characteristics of the sample related to the target variable, that is, TF or non- TF, which are the types of TFs discussed in this context By using different characters, we can construct different feature vectors for one sample, which can significantly affect the performance of the predictor Thus, the choice of a good feature vector is of immense importance Here, we use the functional domain composition feature vector To facilitate a feasible statistical classifier, each TF is expressed in terms of a set of discrete numbers instead of whole amino-acid sequence to catch the core features intimately related to biological functions As TFs are classified according to their structures and functions, it is anticipated that the prediction quality will be enhanced if we can find a feasible approach to use the knowledge of structural and functional domains to define a TF sample, such as DNA-binding domain(s), oligomerization domain(s), and trans-activating domain This can be realized using the integrated domain and motif database, or the interpro databases in [Apweiler et al, 2001] through the following steps: 1 Extract domain information of a protein from InterPro using the Protein2ipr mapping provided Here, we use the Protein2ipr release 120 on Friday, 18 November 2005 for our TF/non-TF dataset The result covers 8151 InterPro entries with well-known structural and functional domain types 2 With each of the 8151 functional domain patterns as a vector-base, the sample of a TF can be represented in an 8151D (dimensional) vector as follows If there is a hit, for example, TF P49716 contains IPR004827, which is the 1970th record of the 8151 domains, then the 1970th component of the TF P49716 in the 8151D feature space is set to 1; otherwise, 0 3 Feature vector T for a given TF can thus be explicitly formulated as: t 1 t 2 T =, (1) t i t 8151

5 Z Qian, Z He & Y Cai 189 where, t i = { 1, hit found, 0, otherwise (2) Defined in this way, each sample in our dataset corresponds to an 8151D vector T, with each of the 8151 functional domains as the base for the vector space That is, rather than through the amino acid composition approach or other representation, a TF is now represented in terms of the functional domain composition By doing such, some function-related features, aside from the sequence-related ones, are integrated in this representation This representation approach has been proven useful not only for this unique discussion 223 Constructing a Predictor based on the Machine Learning Approach After constructing the dataset with all the samples represented by a feature vector, building the predictor based on the machine learning approach we mentioned above becomes possible Here, we use the NNA to construct our two predictors As the distance between the two samples must be defined, we use the following D to represent the distance between V x and V y : D(V x,v y ) = 1 V x V y V x V y where V x V y is the dot product of the two vectors, and V i is the module of V i 224 Testing Predictors There are three different approaches, namely, the single independent dataset test, the sub-sampling test, and the Jackknife test often used to evaluate the power of a predictor in statistical prediction Among them, the Jackknife cross-validation test is considered the most objective and rigorous and hence has been used in many studies In our research, Jackknife cross-validation tests were operated as follows: To identify TF/non-TF for each protein T in the dataset consisting of 74 TFs and 1558 non-tfs, we applied the first ISort classifier to predict T s property (TF/non-TF) using the rest of the proteins excluding T The classifier succeeded as it correctly predicted the property of T The success rate for TF/non-TF identification was given according to the following formulas: Correctly predicted TF Success rate for TF =, True TF Correctly predicted non-tf Success rate for non-tf =, True non-tf Correctly predicted samples Success rate for overall = Total samples To classify TFs into the four different classes, for each protein T in the dataset consisting of 74 TFs, we applied the second ISort classifier to predict T s classification using the rest of the proteins excluding T The classifier succeeded as it correctly predicted the classification of T The success rate for TF classification

6 190 Chapter 13 Secure Outsourcing of DNA Databases Category Jackknife test success rate TF 66/74 = 892% Non-TF 1540/1558 = 988% Overall 1606/1632 = 984% Table 2: Performances of TF/non-TF identification Classification Jackknife test success rate Basic domain 20/20 = 100% Zinc-coordinate 10/11 = 909% Helix-turn-helix 33/33 = 100% β-scaffold 6/7 = 857% Overall 69/71 = 972% Table 3: Performances of TF classification was given according to the following formulas: 23 Results Correctly predicted basic domain Success reate for basic domain = True basic domain Correctly predicted zinc-coordinating Success rate for zinc-coordinating = True zinc-coordinating Correctly predicted helix-turn-helix Success rate for helix-turn-helix = True helix-turn-helix Correctly predicted beta-scaffold Success rate for beta-scaffold = True beta-scaffold Correctly predicted samples Success rate for overall = Total samples Two 8151D classifiers were built according to the above process, one for identifying TF/non-TFs and another for further classifying TFs into four different categories: basic domains, zinc-coordinating DNA-binding domains, helix-turn-helix, and β-scaffold factors According to Steps 1-3 as mentioned above, we obtained the following results: (1) For TF/non-TF identification, with the exclusion of proteins without functional domain annotation and orphans with domains occurring only once in our original dataset, 8151D feature vectors were built for 74 TFs and 1558 non-tfs (2) For TF classification, three more TFs were filtered because of orphans; thus, 8151D feature vectors were built for 71 TFs Tables 2 and 3 present the success rates of the Jackknife cross-validation test for TF/non-TF identification and TF classification, respectively Our predictors achieved very good performance As shown in Table 2, the success rates are 892 and 988% for TF and non-tf identification, respectively, and 984% overall As shown in Table 3, the success rates reach 100, 909, 100, and 857% for basic domain TFs, zinc-coordinating TFs, helix-turn-helix TFs, and β-scaffold TFs, respectively, and 972% overall These remarkable results demonstrate that domain composition is a highly effective means to characterize the features of TF for classification 3 Applying Machine Learning Approach to Predict Transcription Factor Binding Sites Predicting the novel transcription factor is only the first step to reveal the mechanism of the gene expression regulation by TF The second issue that must be dealt with is to determine the partiality of a newly identified

7 Z Qian, Z He & Y Cai 191 TF for certain types of DNA sequences Typically in cases like this, we resort to a vast number of experiments, such as DNA footprint technology, to find direct evidence of the interaction between the TF and its target genes or build statistical models to describe the DNA preferences of the TFs However, both methods are not ideal when tackling challenging problems because they are costly and time consuming Here, we apply the machine learning approach to solve this problem, similar to how we previously solved the TF prediction We will show that this approach can similarly obtain a satisfying result 31 Machine Learning Approach to Solve This Problem 311 Problem transformation Similar to how we solved the problem on TF prediction, the first step is to transform the problem into a classification problem that can be solved by the machine learning approach Thus, we generate a new problem: assigning it to be positive (the given TF regulates the given TFT by binding to the given TFBS) or negative (otherwise) given a combination of TF-transcription factor target (TFT)-transcription factor binding site (TFBS) By searching all possible combinations when given a TF, determining its TFT and TFBS in the whole genome by solving this novel problem theoretically becomes possible Clearly, we can solve this new problem using the machine learning approach as mentioned above 312 Dataset Generation For TFs and their targets and binding sites, we can obtain the complete known information from TRANSFAC (TRANSFAC v70 was used here) The original dataset can be filtered through the following steps (1) 327 TFs and 113 TFTs without SwissProt accessions are removed, and associated 407 TFBSs are filtered (2) 743 TFBSs with length shorter than 5 bp or longer than 25 bp are removed, as most of the lengths of the TFBSs are within this range (3) Finally, a positive dataset with 3430 TF-TFT-TFBS triplets, covers 143 TF, 1416 TFT, and 571 TFBS, is built As mentioned above, a negative dataset is also needed We can generate it by randomly shuffling the TFBS column in the collected positive dataset according to the following steps (1) Each TF-TFT-TFBS triplet is assigned a random number (2) TFBSs are shuffled according to the random numbers, while the TFs and TFTs are not changed (3) The reduplicated record(s) that already exist(s) in the positive dataset are removed (4) Steps 1, 2, and 3 are repeated twice (5) Finally, we achieve a negative dataset with approximately 7000 records, which is two times larger than the positive dataset 313 Feature Vector Construction The prerequisite for building a predictor is to construct a feature vector for each sample As one sample consists of both proteins (TF, TFT) and nucleotide sequences (TFBS), developing feasible numeric representation systems for proteins as well as for nucleotide sequences is necessary For proteins, the functional domain composition feature vector, the same system we employed to represent TF when solving the TF prediction problem, can also be used here For TFBS, which is a nucleotide sequence, we introduce a new system to represent each of them with a feature vector Without information loss, TFBS is encoded with a 0-1 system according to the following steps: 1 First, TFBSs with a length of <25 bp are extended to exactly 25 bp by adding N suffixes For example, the 21 bp nucleotide sequence: - TTCGATCGATCGATCGATCGT is extended to

8 192 Chapter 13 Secure Outsourcing of DNA Databases - TTCGATCGATCGATCGATCGNNNNN while TFBSs with an exact length of exact 25 bp remains unchanged 2 These TFBSs are then represented in a 25 5 = 125D (dimensional) vector For example, let us consider the sequence TTCGATCGATCGATCGATCGNNNNN, it is represented in a 125D binary vector as: T T C G A T C G A T C G N A N T N C G A T C G T as four different nucleotides are encoded with four orthogonal 5D binary vectors as A := C := G := T := N := Finally, each TFBS can be formulated as where, d i can be either 0 or 1 D = d 1 d 2 d j d 125 N DNA binding preferences can be inferred by predicting the interactions among TF, TFT, and TFBS as mentioned above Therefore, after obtaining T=(t 1, t 2,, t 8152 ) T representing TF, G=(g 1, g 2,, g 8152 ) T representing TFT, and D=(d 1, d 2,, d 125 ) T, the next step is to combine these three into a particular feature vector R=TGD to cover the TF-TFT-TFBS triplets This is conducted as follows Suppose T x, G y, and D z are the x-th TF, y-th TFT, z-th TFBS, respectively The (x, y, z) TF-TFT-TFBS triplets R can be expressed as R = TGD(x,y,z) = T x G y (k bf D z ) t (x,1) t (x,2) = t (x, j) t (x,8151) g (y,1) g (y,2) g (y, j) g (y,8151) k d (z,1) k d (z,2) k d (z, j) k d (z,125) = ( t (x,1),t (x,2),,t (x,8151),g (y,1),g (y,2),,g (y,8151),k d (z,1),k d (z,2),,k d (z,123) ) T (3) (4) (5)

9 Z Qian, Z He & Y Cai 193 Methods Our Previous Work NNA SVM (polynomial kernel) Jackknife 10-fold Jackknife 10-fold Jackknife 10-fold Success rate on positive dataset 7190 % NA 8470 % 83 % NA 7640 % Success rate on negative dataset 7890 % NA 8930 % 8910 % NA 8800 % Overall success rate 7660 % NA 8790 % 87 % NA 8410 % Table 2: Performance of our predictor where the operator represents the orthogonal sum, k is the weight to facilitate removing the bias caused by the contribution difference between two encoding systems, and T is the transpose operator By using this hybridization approach, all the TF-TFT-TFBS triplets can be represented in the simple form of a 16427D (8151D D + 125D) vector Similar approaches have also been used in several previous works, such as predicting protein protein interaction and achieving excellent performance This indicates that the hybridization approach provides an efficient encoding system for the numeric representation of interacting pairs/triplets 314 Construction of the Predictor based on Machine Learning Approach and Predictor Test The same NNA is also used here to construct the predictor See the above section for more details SVM can also be used to construct another predictor Polynomial kernel is used as the kernel function of our SVM classifier To test the predictor s performance, the Jackknife cross validation method is used here in the same way it was used for the TF predictor To compare our two predictors based on different machine learning approaches, a 10-fold cross validation is also used The Ten-fold cross-validation tests employed in this study are conducted in the following steps (1) The dataset including both positive and negative data is randomly split into 10 parts (2) For each part, we attempt to predict the category of each sample using the other parts as the training data The prediction is considered correct if the predictive category of one sample is the same as the real one 32 Results Our dataset consists of 3430 true TF-TFT-TFBS triplets and 7000 artificial TF-TFBS-TFT triplets, with k set to 05 (Formula 5) For the predictor based on NNA, the success rate on the positive and negative dataset is 847 and 893%, respectively; the overall success rate is 879% (Table 2) The success rate of the 10-fold cross validation on the positive and negative dataset is 830 and 891%, respectively, with an overall accuracy of 870% (Table 2) For the predictor based on SVM, the success rate of the 10-fold cross-validation test on the positive and negative dataset is 764 and 880%, respectively, with an overall accuracy of 841% The total result shows that DNA binding preferences are closely correlated to both their function and that of their target By comparing our present work with our previous one that employed only the information on TF and TFBS, we can conclude that the prediction performance using the TF-TFT-TFBS triplets increased as expected because we also considered the TFTs

10 194 Chapter 13 Secure Outsourcing of DNA Databases References [Apweiler et al, 2001] Apweiler, R, Attwood, T, Bairoch, A, Bateman, A, Birney, E, Biswas, M, Bucher, P, Cerutti, L, Corpet, F, Croning, M, et al (2001) The InterPro database, an integrated documentation resource for protein families, domains and functional sites Nucleic Acids Research, 29(1), 37 [Attwood, 2002] Attwood, T (2002) The PRINTS database: a resource for identification of protein families Briefings in bioinformatics, 3(3), 252 [Bhasin et al, 2005] Bhasin, M, Zhang, H, Reinherz, E, & Reche, P (2005) Prediction of methylated CpGs in DNA sequences using a support vector machine FEBS letters, 579(20), [Cai & Chou, 2006] Cai, Y & Chou, K (2006) Predicting membrane protein type by functional domain composition and pseudoamino acid composition Journal of theoretical biology, 238(2), [Chou & Cai, 2004] Chou, K & Cai, Y (2004) Using GO-PseAA predictor to predict enzyme sub-class Biochemical and biophysical research communications, 325(2), [Chou et al, 2006] Chou, K, Cai, Y, et al (2006) Predicting Protein- Protein Interactions from Sequences in a Hybridization Space J proteome Res, 5(2), [D haeseleer, 2006] D haeseleer, P (2006) How does DNA sequence motif discovery work? Nature biotechnology, 24(8), [Finn et al, 2006] Finn, R, Mistry, J, Schuster-Bockler, B, Griffiths-Jones, S, Hollich, V, Lassmann, T, Moxon, S, Marshall, M, Khanna, A, Durbin, R, et al (2006) Pfam: clans, web tools and services Nucleic acids research, 34(Database Issue), D247 [Jamieson et al, 2003] Jamieson, A, Miller, J, & Pabo, C (2003) Drug discovery with engineered zinc-finger proteins Nature Reviews Drug Discovery, 2(5), [Jia et al, 2006] Jia, P, Shi, T, Cai, Y, & Li, Y (2006) Demonstration of two novel methods for predicting functional sirna efficiency BMC bioinformatics, 7(1), 271 [Joachims, 1999] Joachims, T (1999) Making large scale SVM learning practical [Leblanc & Moss, 2000] Leblanc, B & Moss, T (2000) DNase I footprinting The Nucleic Acid Protocols Handbook, (pp ) [Matys et al, 2006] Matys, V, Kel-Margoulis, O, Fricke, E, Liebich, I, Land, S, Barre-Dirrie, A, Reuter, I, Chekmenev, D, Krull, M, Hornischer, K, et al (2006) TRANSFAC (R) and its module TRANSCompel (R): transcriptional gene regulation in eukaryotes Nucleic acids research, 34(Database Issue), D108 [Qian et al, 2006a] Qian, Z, Cai, Y, & Li, Y (2006a) A novel computational method to predict transcription factor DNA binding preference Biochemical and biophysical research communications, 348(3), [Qian et al, 2006b] Qian, Z, Cai, Y, & Li, Y (2006b) Automatic transcription factor classifier based on functional domain composition Biochemical and biophysical research communications, 347(1), [Stormo, 2000] Stormo, G (2000) DNA binding sites: representation and discovery Bioinformatics, 16(1), 16 [Wingender et al, 1996] Wingender, E, Dietze, P, Karas, H, & Kn uppel, R (1996) TRANSFAC: a database on transcription factors and their DNA binding sites Nucleic acids research, 24(1), 238 [Yu et al, 2006] Yu, X, Wang, C, & Li, Y (2006) Classification of protein quaternary structure by functional domain composition BMC bioinformatics, 7(1), 187