Discovering Transcription Factor Binding Motif Sequences

Size: px
Start display at page:

Download "Discovering Transcription Factor Binding Motif Sequences"

Transcription

1 Dicovering Trancription Factor Binding Motif Sequence I Department of Biology, Stanford Univerity, CA, Introduction In biology, equence motif are hort equence pattern, uually with fixed length, that repreent many feature of DNA, RNA, and protein molecule. Sequence motif can repreent trancription factor binding ite for DNA, plice junction for RNA, and binding domain for protein. Thu, dicovering equence motif can lead to a better undertanding of trancriptional regulation, mrna plicing, and the formation of protein complexe. Furthermore, protein motif can repreent the active ite of enzyme or region involved in protein tructure and tability. Motif dicovery i an important computational problem becaue it allow the dicovery of pattern in biological equence in order to better undertand the tructure and function of the molecule the equence repreent. Epecially, identifying regulatory element, epecially the binding ite in DNA for trancription factor, i important to undertand the mechanim that regulate gene expreion. Thee DNA motif pattern are uually fairly hort (5~20 bae pair long) and i known to recur in different gene or everal time within a gene [1]. A DNA equence can have zero, one, or multiple copie of a motif. In addition to thee more common form DNA motif, there are alo palindromic motif (ubequence that i exactly the ame a it own revere complement) and gapped motif (two maller conerved ite eparated by a gap) [2]. The high diverity and variability of motif make them very difficult to identify. A large number of algorithm for finding DNA motif have been developed. Thee algorithm motly detect overrepreented motif and conerved motif that might be good candidate for being trancription factor binding ite. Algorithm that detect overrepreented motif deduce motif by conidering the regulatory region (promoter) of everal co-regulated or co-expreed gene. Co-regulated gene are known to hare ome imilaritie in their regulatory mechanim, poibly at trancriptional level, o their promoter region might contain ome common motif that are binding ite for trancription factor. Thu, the way to detect thee regulatory element i to earch for tatitically overrepreented motif in the promoter region of uch a et of co-expreed gene. However, algorithm that detect overrepreented motif perform not a well in higher organim. To overcome thi, ome algorithm conider conerved motif from orthologou pecie. Since elective preure caue functional equence to evolve lower than non-functional equence, well-conerved ite repreent poible candidate for DNA motif. Recent algorithm have alo combined the two approache to achieve improvement in motif finding. In thi report, we will review a few of the major recent development in DNA motif finding algorithm. 1

2 General Technique for Motif Dicovery Mot motif finding algorithm fall into two major group baed on the combinatorial approach ued: (1) word-baed (tring-baed) method, repreented by regular expreion (RE), or (2) probabilitic equence model baed on poition weight matrice (PWM) [3]. The two method have their own trength and weaknee. The word-baed method relie on exhautively counting and comparing oligonucleotide frequencie, uing regular expreion [4]. Regular expreion, often ued in computer cience, provide a concie and flexible mean to match tring of text, which in thi cae are DNA equence pattern. For example, a poible regular expreion may be: T-A-C-N(2,4)-G-T-A. Thi mean that the RE matche any DNA equence that begin with TAC and end with GTA, with a gap of length two to four in between that can be anything. The word-baed method earche all poible regular expreion exhautively to identify the RE whoe match are mot over-repreented. The advantage of the word-baed method i that it guarantee global optimum, ince it doe an exhautive earch. However, thi alo mean that they are only uitable for hort motif. Alo, although thi method can be fat when implemented with uitable data tructure, it i uually computationally expenive. Thi method i a good choice for finding motif where all intance are identical. However, for typical trancription factor motif that often have everal weakly contrained poition, the word-baed method can uffer [5]. The probabilitic approach involve repreenting the motif with a poition weight matrix (PWM) [6]. A PWM define the probability of each letter in the alphabet occurring at a pecific poition with an n by m matrix. n i the number of letter in the equence (four for DNA) and m i the number of poition in the motif. The entry in row i and column j of the matrix i the probability of a letter i occurring at poition j in the motif, repreented by: Pi, j i n j m Thi model aume that each poition in the motif i tatitically independent of the other. Thu, the probability of a equence i jut the product of the correponding entrie in the PWM. For example, the probability of the equence TACGTA i jut: Pr(" TACGTA") PT PA PC PG PT PA,1,2,3,4,5,6 The probabilitic approach earche the pace of PWM for motif that maximize an objective function that i uually given by ome ort of log-likelihood ratio (LLR): LLR PWM ( ) Pi, j log 2 i j f i i the overall probability of letter i in the equence to be canned for occurrence of the motif. The advantage for probabilitic approache i that, compared with word-baed method, can have each letter match a particular motif poition to varying degree, rather than jut match or no match. Many of the 2 P i, j f i

3 algorithm developed from probabilitic approache are deigned to find longer or more general motif. However, thee algorithm are not guaranteed to find globally optimal olution, unlike word-baed method, ince they employ ome form of local earch (uch a Gibb ampling, expectation maximization, or greedy algorithm) that may not converge to the global optimal olution. Figure 1: The relationhip between the motif ite, and RE, and a PWM. Nine example motif are hown, with the correponding RE and PWM, a well a the LOGO repreentation of the motif (Bailey, 2008). Some literature alo categorize the motif finding algorithm into four clae baed on the input equence: (1) a focued approach: aemble a mall et of equence and earch for over-repreented pattern in the equence, (2) a related focued dicriminative approach: aemble two et of equence and look for pattern relatively over-repreented in one of the input et [7]. (3) a phylogenetic approach uing equence conervation information about the equence in a ingle input et [8]. (4) a whole-genome approach looking for over-repreented, conerved pattern in multiple 3

4 alignment of the genome of two or more pecie [9]. In thi report, the focued approach of co-regulated gene will be emphaized Word-baed Algorithm The motif finding algorithm Oligo-Analyi developed by van Helden et al. i baed on the word-baed approach [4]. Thi algorithm i conceptually very imple. It detect tatitically ignificant motif by counting the number of occurrence of each word or dyad and comparing thee with expectation. Thu, thi algorithm i exhautive. However, it i limited to detecting only relatively imple pattern that include hort motif with highly conerved equence. Later, van Helden extended thi method to include paced gap motif (Dyad-Analyi) [10]. The major diadvantage of thee algorithm are that no variation are allowed within an oligonucleotide. Tompa et al. developed another algorithm with an exact word-baed method to find hort motif in DNA equence [11]. The algorithm take into account both the abolute number of occurrence and the background ditribution and create a table that, for each length-k equence, record the number N of equence containing an occurrence of, where an occurrence allow for a mall-fixed number c of ubtitution reidue in. Then, a reaonable meaure of a a motif would be baed on how likely it i to have N occurrence if the equence were drawn at random according to the background ditribution. Now, let X be a ingle random equence of the pecified length L, with reidue drawn randomly and independently from the background ditribution. Suppoe that p i the probability that X contain at leat one occurrence of the length-k equence, allowing for c ubtitution. We aume that N length-l random equence of X are independent. Thu, the expected number of containing at leat one occurrence of among the N random equence i: The tandard deviation i: Thu, the z-core i: M Np Np (1 p ) N Np (1 p ) 4 Np The algorithm ue an exhautive earch to find motif with the greatet z-core. Tompa et al. built upon thi to develop the algorithm YMF (Yeat Motif Finder) to produce the motif with greatet z-core. Thi approach allowed variation within the oligonucleotide and made more accurate prediction. Brazma et al. alo ued a word-baed approach to develop a motif finding algorithm that look for

5 occurrence of regular expreion-type pattern [12]. Many other algorithm were alo developed that are imilar to thee approache, combining word-baed method with graph-theory method. For example, Sagot et al. introduced a word-baed approach for motif finding that i baed on the repreentation of a et of equence with a uffix tree [13]. Thee implementation increaed the computational efficiency of the algorithm, but have other drawback like more contraint on the target motif to be earched. Probabilitic Algorithm Mot probabilitic motif finding algorithm apply tatitical technique uch a expectation maximization (EM) and Gibb ampling algorithm. Gibb ampling algorithm are ued more extenively among the probabilitic approache. The EM algorithm iteratively maximize the expected log likelihood uing the Poition Weight Matrix (PWM). The MEME algorithm developed by Bailey and Elkan i probably the mot popular EM baed algorithm for identifying motif in unaligned equence [14, 15]. MEME incorporated three idea for dicovering motif: (1) ubequence that actually occur in the equence are ued a tarting point for the EM algorithm to increae the probability of finding globally optimum motif, (2) aumption that each equence contain exactly one occurrence of the hared motif i removed, (3) a method for probabilitically eraing hared motif after they are found i incorporated o that everal ditinct motif can be found in the ame et of equence. EM i a gradient decent method, o it cannot guarantee a global optimum. However, thi mean that the algorithm alway converge in a predictable, relatively mall number of iteration. The Gibb ampling method wa originally developed by Lawrence et al [16]. The Gibb ampler i a Markov Chain Monte Carlo (MCMC) approach for obtaining a equence of radon ample from multivariate probability ditribution. Markov chain mean that the reult from every tep depend only on the reult of the preceding one, like in EM. Monte Carlo mean that the way to elect the next tep i not determinitic but rather baed on random ampling. The MCMC alo ue a probability matrix, and iterate until an optimal alignment i found when the ratio of motif probability to the background probability reache a maximum. More formally, we aume that we are given a et of N equence S 1,,S N, and we eek within each equence mutually imilar egment of pecified width W. The algorithm proceed through iteration of two tep, with each tep maintaining an evolving data tructure. The firt i the pattern decription, in the form of a probabilitic model of reidue frequencie for each poition i from 1 to W, and coniting of the variable q i,1,,q i,4 indexed by W poition and 4 poible reidue (for DNA). There i alo a background decription, p 1, p 4 with which reidue occur in ite not decribed by the pattern. The econd data tructure i for alignment, which ha a et of poition a k for k from 1 to N, for the common 5

6 pattern within the equence. The algorithm proceed through multiple iteration that execute the two tep: a predictive update tep and a ampling tep. For the predictive update tep, one of the N equence, z, i choen at random. The pattern decription q i,j and background frequencie p j are then calculated from the current poition a k in all equence excluding z. For the ampling tep, every poible egment of width W within equence z i conidered a poible intance of the pattern. The probabilitie Q x of generating each egment x according to the current pattern probabilitie q i,j are calculated a well a the probabilitie P x of generating thee egment by the background probabilitie p j. The weight A x = Q x /P x i aigned to egment x and a random egment i elected uing thee weight. It poition then become the new a z. Thi iterative procedure allow accurate determination of motif ince the more accurate the pattern decription contructed in tep 1, the more accurate determination of it location in tep 2, and thi can converge to very accurate determination of motif. There are everal method baed on Gibb ampling approach. The following are three example: (1) Baed on the Gibb ampling approach, Roth et al. developed the motif finding algorithm AlignACE (Align Nucleic Acid Conerved Element) [17]. Thi algorithm return a erie of motif a weight matrice that are overrepreented in the input et of DNA equence. AlignACE ue the MAP (maximum a priori log-likelihood) core to judge different motif ampled, which gauge the degree of overrepreentation. (2) Thij et al. developed another motif finding algorithm, MotifSampler, which i alo a modification of the Gibb ampling algorithm [18]. MotifSampler ue a probability ditribution to etimate the number of copie of the motif in a equence and incorporate a higher-order Markov-chain background model. (3) Another approach uing the Gibb ampling trategy, Liu et al. developed the motif finding algorithm, BioPropector [19]. BioPropector ue the promoter region of co-regulated gene. It alo ue zero to third-order Markov background model, and the ignificance of each motif i judged baed on a motif core ditribution etimated by a Monte Carlo method. Other Algorithm and Application There are other algorithm, for example, that combine the word-baed method and probabilitic approache, like the MDScan algorithm. The TAMO algorithm run multiple motif dicovery algorithm (MEME, AlignACE and MDcan) and combine the reult [20]. Other approache are baed on other machine learning technique, neural network, and clutering algorithm. Algorithm baed on phylogenetic foot-printing ha the advantage of the co-regulated gene approach i that co-regulated method require a way for identifying co-regulated gene; phylogenetic foot-printing approach i poible to identify motif pecific to even a ingle gene a long a they are ufficiently conerved acro the many orthologou equence conidered. There are alo algorithm that are baed on promoter 6

7 equence of co-regulated gene and phylogenetic foot-printing. Thee algorithm are often packaged into uer-friendly interface, either on web erver or toolboxe. For example, the uer-friendly interface, Toolbox of Motif Dicovery (Tmod), integrate 12 widely ued motif dicovery program: MDcan, BioPropector, AlignACE, Gibb Motif Sampler, MEME, CONSENSUS, MotifRegreor, GLAM, MotifSampler, SeSiMCMC, Weeder and YMF [21]. Dicuion There are a large number of motif finding algorithm are available that it i impoible to provide a comprehenive report. Each algorithm ha it own advantage and diadvantage. The two main approache for thee motif finding algorithm are word-baed method and probabilitic approach. The main advantage of word-baed method are that they are eay for u to viualize and for computer to earch for the motif. It i alo eaier to compute the tatitical ignificance of a motif divined a a regular expreion. On the other hand, PWM allow for a more flexible decription of motif becaue each letter can match a particular motif poition to varying degree rather than imply matching or not matching. The main diadvantage of PWM for motif dicovery i that they are far more difficult for computer algorithm to earch for. However, it i difficult to ae the performance and compare directly thee algorithm. Thi i becaue each individual tool may do better on one type of data but do wore on other type of data. Alo, ince we till do not have a complete undertanding of the biology of regulatory mechanim, it i difficult to evaluate the accuracy of thee algorithm. A few of the main algorithm, which are decribed above, are ummarized below. Table 1: Summary of the algorithm ued for DNA motif finding PWM-baed algorithm Web Server MEME Gibb AlignACE MotifSampler dna/bioi/software.html BioPropector MDScan 7

8 RE-baed algorithm Web Server Oligo/Dyad-Analyi YMF Weeder Concluion Trancription factor bind to DNA motif and modulate gene expreion. Thu, identification of motif in the promoter region of gene will help undertand the regulation of gene expreion. Thi problem ha been of great intereting to computer cientit and biologit to ue computational method for motif finding. Thi provide a imple and efficient method to identify motif without having to do time-conuming experiment. There are myriad of algorithm available for motif finding, each with their advantage and diadvantage. Divere approache, including combinatorial enumeration, probabilitic modeling, mathematical programming, neural network, and genetic algorithm, have been ued. It i difficult to ae which motif finding tool i the bet, ince we do not have a clear undertanding of the biology of regulatory mechanim, o we lack an abolute tandard againt which to meaure the correctne of thee approache. Thu, when uing motif finding tool, it i important to ue a few complementary tool in combination rather than relying on a ingle one. Motif finding algorithm, combined with high-throughput trancriptional regulation microarray aay, will allow u to gain further undertanding of trancription regulation and gene expreion. Reference: 1. Rombaut, S., et al., PlantCARE, a plant ci-acting regulatory element databae. Nucleic Acid Re, (1): p Da, M.K. and H.K. Dai, A urvey of DNA motif finding algorithm. BMC Bioinformatic, Suppl 7: p. S Bailey, T.L., Dicovering equence motif. Method Mol Biol, : p van Helden, J., B. Andre, and J. Collado-Vide, Extracting regulatory ite from the uptream region of yeat gene by computational analyi of oligonucleotide frequencie. J Mol Biol, (5): p

9 5. Vilo, J., et al., Mining for putative regulatory element in the yeat genome uing gene expreion data. Proc Int Conf Intell Syt Mol Biol, : p Bucher, P., Weight matrix decription of four eukaryotic RNA polymerae II promoter element derived from 502 unrelated promoter equence. J Mol Biol, (4): p Sinha, S., Dicriminative motif. J Comput Biol, (3-4): p Moe, A.M., D.Y. Chiang, and M.B. Eien, Phylogenetic motif detection by expectation-maximization on evolutionary mixture. Pac Symp Biocomput, 2004: p Kelli, M., et al., Method in comparative genomic: genome correpondence, gene identification and regulatory motif dicovery. J Comput Biol, (2-3): p van Helden, J., A.F. Rio, and J. Collado-Vide, Dicovering regulatory element in non-coding equence by analyi of paced dyad. Nucleic Acid Re, (8): p Tompa, M., An exact method for finding hort motif in equence, with application to the riboome binding ite problem. Proc Int Conf Intell Syt Mol Biol, 1999: p Brazma, A., et al., Predicting gene regulatory element in ilico on a genomic cale. Genome Re, (11): p Sagot, M.F., Spelling approximate repeated or common motif uing a uffix tree. Latin '98: Theoretical Informatic, : p Bailey, T.L., et al., MEME: dicovering and analyzing DNA and protein equence motif. Nucleic Acid Re, (Web Server iue): p. W Bailey, T.L., M.E. Baker, and C.P. Elkan, An artificial intelligence approach to motif dicovery in protein equence: application to teriod dehydrogenae. J Steroid Biochem Mol Biol, (1): p Lawrence, C.E., et al., Detecting ubtle equence ignal: a Gibb ampling trategy for multiple alignment. Science, (5131): p Roth, F.P., et al., Finding DNA regulatory motif within unaligned noncoding equence clutered by whole-genome mrna quantitation. Nat Biotechnol, (10): p Thij, G., et al., A Gibb ampling method to detect overrepreented motif in the uptream region of coexpreed gene. J Comput Biol, (2): p Liu, X., D.L. Brutlag, and J.S. Liu, BioPropector: dicovering conerved DNA motif in uptream regulatory region of co-expreed gene. Pac Symp Biocomput, 2001: p Gordon, D.B., et al., TAMO: a flexible, object-oriented framework for analyzing trancriptional regulation uing DNA-equence motif. Bioinformatic, (14): p Sun, H., et al., Tmod: toolbox of motif dicovery. Bioinformatic, (3): p