Construction of De Bruijn Grph for Assemly from Truncted Sux Tree Bstien Czux, Thierry Lecroq, Eric Rivls LIRMM & IBC, Montpellier - LITIS Rouen Mrs 3, 2015
Introduction De Bruijn Grph for ssemly R = {c, cc, c, cc, cc} Czux, Lecroq, Rivls Truncted Sux Tree & DBG 1 / 30
Introduction De Bruijn Grph for ssemly R = {c, cc, c, cc, cc} c c c c Czux, Lecroq, Rivls Truncted Sux Tree & DBG 1 / 30
3 2 Introduction Generlized Sux Tree (GST) R = {c, cc, c, cc, cc} c 7 7 6 6 5 c c 6 6 5 4 c 1c 4 5 5 c 3 3c 4 5 4 1c 2c 3 2c 2c 1 4 3 2 1c 1c Czux, Lecroq, Rivls Truncted Sux Tree & DBG 2 / 30
3 2 Introduction Generlized Sux Tree with cut R = {c, cc, c, cc, cc} c 7 7 c 6 6 c 5 6 6 5 4 1 c 4 5 5 c c 3 3c 4 5 4 1c 2c 3 2c 2c 1 4 3 2 1c 1c Czux, Lecroq, Rivls Truncted Sux Tree & DBG 2 / 30
Introduction Truncted Sux Tree (TST) R = {c, cc, c, cc, cc} c c c 1 4 1 6 6 5c 3 4 c c 5 2 5 2 1 3 3 4 3 2 5 2 4 2 4 1 3 1 Czux, Lecroq, Rivls Truncted Sux Tree & DBG 2 / 30
Introduction Motivtion De Bruijn Grph is lrgely used in de novo genome ssemly. [Pevzner et l., 2001] One uilds sux tree efore the ssemly for some pplictions, for instnce for the error correction. [Slmel, 2010] There exist lgorithms to uild directly the De Bruijn Grph [Onoder et l., 2013] [Rodlnd, 2013] nd the Contrcted De Bruijn Grph [Czux et l., 2014][Chikhi et l., 2014]. Czux, Lecroq, Rivls Truncted Sux Tree & DBG 3 / 30
Introduction Indexing dt structures Numerous dt structures: sux tree, x tree, sux tle, fctor utomt, etc. to index one or severl texts (generlized index) functionnlly equivlent Czux, Lecroq, Rivls Truncted Sux Tree & DBG 4 / 30
Introduction Indexing dt structures Numerous dt structures: sux tree, x tree, sux tle, fctor utomt, etc. to index one or severl texts (generlized index) functionnlly equivlent Result: We cn directly uild the ssemly De Bruijn grph in the clssicl or contrcted form from n indexing dt structures.[czux et l., 2014] Czux, Lecroq, Rivls Truncted Sux Tree & DBG 4 / 30
Introduction Indexing dt structures Numerous dt structures: sux tree, x tree, sux tle, fctor utomt, etc. to index one or severl texts (generlized index) functionnlly equivlent Result: We cn directly uild the ssemly De Bruijn grph in the clssicl or contrcted form from n indexing dt structures.[czux et l., 2014] Question: How to do it without using more spce thn necessry? Czux, Lecroq, Rivls Truncted Sux Tree & DBG 4 / 30
Chin of sux-dependnt strings nd Tree 1 Chin of sux-dependnt strings nd Tree 2 Truncted Sux Tree (TST) 3 De Bruin Grph vi the TST 4 Conclusion Czux, Lecroq, Rivls Truncted Sux Tree & DBG 5 / 30
Chin of sux-dependnt strings nd Tree Chin of sux-dependnt strings nd Tree Czux, Lecroq, Rivls Truncted Sux Tree & DBG 5 / 30
Chin of sux-dependnt strings nd Tree String Denition [Guseld 1997] Let w string. sustring of w is string included in w, prex of w is sustring which egins w nd sux is sustring which ends w. n overlp etween w nd v is sux of w which is lso prex of v. w Czux, Lecroq, Rivls Truncted Sux Tree & DBG 6 / 30
Chin of sux-dependnt strings nd Tree String Denition [Guseld 1997] Let w string. sustring of w is string included in w, prex of w is sustring which egins w nd sux is sustring which ends w. n overlp etween w nd v is sux of w which is lso prex of v. w Czux, Lecroq, Rivls Truncted Sux Tree & DBG 6 / 30
Chin of sux-dependnt strings nd Tree String Denition [Guseld 1997] Let w string. sustring of w is string included in w, prex of w is sustring which egins w nd sux is sustring which ends w. n overlp etween w nd v is sux of w which is lso prex of v. w Czux, Lecroq, Rivls Truncted Sux Tree & DBG 6 / 30
Chin of sux-dependnt strings nd Tree String Denition [Guseld 1997] Let w string. sustring of w is string included in w, prex of w is sustring which egins w nd sux is sustring which ends w. n overlp etween w nd v is sux of w which is lso prex of v. w Czux, Lecroq, Rivls Truncted Sux Tree & DBG 6 / 30
Chin of sux-dependnt strings nd Tree String Denition [Guseld 1997] Let w string. sustring of w is string included in w, prex of w is sustring which egins w nd sux is sustring which ends w. n overlp etween w nd v is sux of w which is lso prex of v. w v Czux, Lecroq, Rivls Truncted Sux Tree & DBG 6 / 30
Chin of sux-dependnt strings nd Tree String Denition [Guseld 1997] Let w string. sustring of w is string included in w, prex of w is sustring which egins w nd sux is sustring which ends w. n overlp etween w nd v is sux of w which is lso prex of v. w v Czux, Lecroq, Rivls Truncted Sux Tree & DBG 6 / 30
Chin of sux-dependnt strings nd Tree String Denition [Guseld 1997] Let w string. sustring of w is string included in w, prex of w is sustring which egins w nd sux is sustring which ends w. n overlp etween w nd v is sux of w which is lso prex of v. w v u Czux, Lecroq, Rivls Truncted Sux Tree & DBG 6 / 30
Chin of sux-dependnt strings nd Tree Norm of set of words R = {c, cc, c, cc, cc} R = w i R w i R = 7 + 5 + 6 + 7 + 6 = 31 Czux, Lecroq, Rivls Truncted Sux Tree & DBG 7 / 30
6 $ Chin of sux-dependnt strings nd Tree Sux Tree c$ $ c 1$c c$ 2$c c$ 4 3 5 7 c $ 1 2 3 4 5 6 7 1c 6 c 2c 3 c 4 5 c 1 2 3 4 5 6 Theorem The GST of set of words R tkes liner spce in R. Czux, Lecroq, Rivls Truncted Sux Tree & DBG 8 / 30
Chin of sux-dependnt strings nd Tree Chin of sux-dependnt strings Denition A string x is sid to e sux-dependnt of nother string y if x[2.. x ] is prex of y. Let w e string nd m e positive integer smller thn w 1. A m-tuple of m strings (x 1,..., x m ) is chin of sux-dependnt strings of w if x 1 is prex of w nd for ech i [2, m], x i is prex of w[i, w ] such tht x i x i 1 1. w x 1 x 2 x 3 x 4 Czux, Lecroq, Rivls Truncted Sux Tree & DBG 9 / 30
Chin of sux-dependnt strings nd Tree T (S) tree Denition Let R = {w 1,..., w n } e set of strings nd S = {C 1,..., C n } set of tuples such tht for i [1, n], C i is chin of sux dependnt strings of w i. T (S) is the tree of the contrcted Aho-Corsick tree of S. w x 1 x 2 x 3 x 4 3 1 4 2 Czux, Lecroq, Rivls Truncted Sux Tree & DBG 10 / 30
Chin of sux-dependnt strings nd Tree Liner construction of T (S) Theorem For set of chins of sux-dependnt strings S of set of strings R, we cn construct T (S) in O( R ) time nd spce. Czux, Lecroq, Rivls Truncted Sux Tree & DBG 11 / 30
Chin of sux-dependnt strings nd Tree Appliction to well known structures Exmple Let R = {w 1,..., w n } e set of strings nd S = {C 1,..., C n } set of tuples such tht for i [1, n], C i is chin of sux dependnt strings of w i. For n = 1, n S = {C 1 } the tuple of suxes of w 1, T (S) is the Contrcted Sux Tree of R, For C i the tuple of of suxes of w i for ll i [1, n],t (S) is the Generlised Contrcted Sux Tree of R. We cn construct the Truncted Sux Tree of [Peng et l., 2003] We cn construct the Generlised Truncted Sux Tree of [Schulz et l., 2008] Czux, Lecroq, Rivls Truncted Sux Tree & DBG 12 / 30
Truncted Sux Tree (TST) Truncted Sux Tree (TST) Czux, Lecroq, Rivls Truncted Sux Tree & DBG 13 / 30
Truncted Sux Tree (TST) Our Truncted Sux Tree Denitions For set of words R = {w 1, w 2,..., w n } nd n integer k > 0, we dene the following nottion. 1 F k (R) is the set of sustrings of length k of words of R. 2 Su k (R) is the set of suxes of length k of words of R. 3 For ll i [1, R ] nd j [1, w i k + 1], A k,i denotes the tuple such tht its j th element is dened y A k,i [j] := { w i [j, j + k] w i [j, w i ] if j w i k otherwise. 4 nd nlly A k is the set of these tuples: A k := n i=1 A k,i. Czux, Lecroq, Rivls Truncted Sux Tree & DBG 14 / 30
Truncted Sux Tree (TST) Exmple of TST Proposition 1 A k,i is chin of sux-dependnt strings of w i. 2 Moreover, {w A k,i A k,i A k } = F k+1 (R) Su k (R). For R = {}, we hve A 4 = {(,,,,,, )}. w x 1 x 2 x 3 x 4 x 5 x 6 x 7 Czux, Lecroq, Rivls Truncted Sux Tree & DBG 15 / 30
Truncted Sux Tree (TST) Liner construction of T (A k ) Corollry We cn construct T (A k ) in O( R ) time nd spce. For R = {}, we hve A 4 = {(,,,,,, )}. 3 4 6 1 7 5 2 Czux, Lecroq, Rivls Truncted Sux Tree & DBG 16 / 30
Truncted Sux Tree (TST) Exmple of Truncted Sux Tree R = {c, cc, c, cc, cc} c c c 1 4 1 6 6 5c 3 4 c c 5 2 5 2 1 3 3 4 3 2 5 2 4 2 4 1 3 1 Czux, Lecroq, Rivls Truncted Sux Tree & DBG 17 / 30
De Bruin Grph vi the TST De Bruin Grph vi the TST Czux, Lecroq, Rivls Truncted Sux Tree & DBG 18 / 30
De Bruin Grph vi the TST Exmple of construction: De Bruijn Grph DBG 2 Czux, Lecroq, Rivls Truncted Sux Tree & DBG 19 / 30
De Bruin Grph vi the TST Exmple of construction: De Bruijn Grph DBG 2 Czux, Lecroq, Rivls Truncted Sux Tree & DBG 19 / 30
De Bruin Grph vi the TST Exmple of construction: De Bruijn Grph DBG 2 Czux, Lecroq, Rivls Truncted Sux Tree & DBG 19 / 30
De Bruin Grph vi the TST Exmple of construction: De Bruijn Grph DBG 2 Czux, Lecroq, Rivls Truncted Sux Tree & DBG 19 / 30
De Bruin Grph vi the TST Truncted Sux Tree (TST) c c c 1 4 1 6 6 5c 3 4 c c 5 2 5 2 1 3 3 4 3 2 5 2 4 2 4 1 3 1 R = {c, cc, c, cc, cc} nd k = 2 Czux, Lecroq, Rivls Truncted Sux Tree & DBG 20 / 30
De Bruin Grph vi the TST Truncted Sux Tree (TST) initil exct node 1 c 4 1 suinitil node c 2 1 initil node 5 2 4 2 R = {c, cc, c, cc, cc} nd k = 2 Czux, Lecroq, Rivls Truncted Sux Tree & DBG 20 / 30
De Bruin Grph vi the TST Nodes of the de Bruijn Grph Nottion: Init(R) Let Init(R) denote the set of initil nodes of the TST of R. Property: node correspondence The set of k-mers of DBG k of R is isomorphic to Init(R). Czux, Lecroq, Rivls Truncted Sux Tree & DBG 21 / 30
De Bruin Grph vi the TST Arcs of the de Bruijn Grph Ide 1 Tke n initil node v 2 follow its sux link to node z (lose the rst letter of its k-mer) 3 if needed, go the children of z to nd its extensions 4 check whether the extensions re vlid Czux, Lecroq, Rivls Truncted Sux Tree & DBG 22 / 30
De Bruin Grph vi the TST Let v e n initil node, u its fther, nd z the node pointed t y the sux link of v. u SL(u) v Kinship property of sux links in sux trees z = SL(v) Let v e node of sux tree. If it exists, the sux link of v elongs to the su-tree of the sux link of p(v). Czux, Lecroq, Rivls Truncted Sux Tree & DBG 23 / 30
De Bruin Grph vi the TST Exmple of construction of rcs of DBG k v is initil exct with severl children Czux, Lecroq, Rivls Truncted Sux Tree & DBG 24 / 30
De Bruin Grph vi the TST DBG construction Theorem Given the TST of set of words R. The construction of the De Bruijn Grph tkes liner time in R. Proof All dierent cses of the typology re processed in constnt time. Czux, Lecroq, Rivls Truncted Sux Tree & DBG 25 / 30
De Bruin Grph vi the TST DBG 2 of R emedded in the TST of R c c c 1 4 1 6 6 5c 3 4 c c 5 2 5 2 1 3 3 4 3 2 5 2 4 2 4 1 3 1 R = {c, cc, c, cc, cc} nd k = 2 Czux, Lecroq, Rivls Truncted Sux Tree & DBG 26 / 30
De Bruin Grph vi the TST Liner spce construction Theorem Given the TST of set of words R. The construction of the De Bruijn Grph tkes liner spce in the size of the De Bruijn Grph. Proof The size of the TST is liner in the size of the De Bruijn Grph of the sme order. Czux, Lecroq, Rivls Truncted Sux Tree & DBG 27 / 30
Conclusion Conclusion Czux, Lecroq, Rivls Truncted Sux Tree & DBG 28 / 30
Conclusion Conclusion An lgorithm tht uilds the De Bruijn Grph from Truncted Sux Tree in liner time in the size of the input nd in liner spce in the size of the output. Czux, Lecroq, Rivls Truncted Sux Tree & DBG 29 / 30
Conclusion Funding nd cknowledgments Thnks for your ttention Questions? Czux, Lecroq, Rivls Truncted Sux Tree & DBG 30 / 30