2015.6.12 De novo sequence assembly 徐唯哲 Paul Wei Che HSU 中央研究院分子生物研究所研究助技師 Assistant Research Specialist Bioinformatics Service Core, Institute of Molecular Biology, Academia Sinica, Taiwan, R.O.C. Bioinformatics Service Core 1
De novo sequence assembly Genome assembly Transcriptome assembly Metagenome 00 assembly
Shortest common superstring (SCS) Given a collection of strings S, find SCS(S): the shortest string that contains all strings in S as substrings Example: S: BAA AAB BBA ABA ABB BBB AAA BAB Concatenation: BAAAABBBAABAABBBBBAAABAB 24 Without requirement of shortest SCS(S): AAABBBABAA 10 AAA AAB ABB BBB BBA BAB ABA BAA (Ben Langmead, http://www.langmead lab.org/teaching materials/)
De novo genome assembly Unknown Genome Shotgun sequencing DNA is sheared into random fragments (reads or tags) assembly 4
de novo assembly algorithms String based assemblers (Greedy extension algorithm) Graph based assemblers: Overlap layout consensus (OLC) de Bruijn graph assembly (dbg) 5
String based assemblers (Greedyextension algorithm) SSAKE (2007), SHARCGS (2007), QSRA (2009) are applicable to illumina platform More time consuming, suitable for small amount of reads(low throughput), smaller genomes Greedy algorithm is not guaranteed to choose overlaps yielding SCS, but is a good approximation. 6
Shortest common superstring: greedy Greedy SCS algorithm in action (l=1) Input strings ABA ABB AAA AAB BBB BBA BBB 2 BAAB ABA ABB AAA BBB BBA BAB 2 BABB BABB ABA AAA BBB BBA 2 BBAAB 2 BBBAAB BABB BABB ABA ABA AAA BBB AAA 2 BBBAABA 2 BABBBAABA BABB AAA AAA 1 BABBBAABAAA BABBBAABAAA Superstring BAA In red are strings that get merged before the next round Greedy answer: BABBBAABAAA Actual SCS: AAABBBABAA Rounds of merging, one merge per line. Number in first column = length of overlap merged before that round (Ben Langmead, http://www.langmead lab.org/teaching materials/)
String overlap alogrithm Greedy extension algorithm Identify overlapping area (select the highest score) overlap Merge overlapping sequences merge Identify overlapping area again, then merge (rerun again) Until sequences cannot be merged anymore 8
Graph based assemblers High speed, suitable for big amount of reads(high throughput), bigger genomes Overlap layout consensus (OLC) Newbler (2006, 454 platform), Forge(2009, 454+ illumina) de Bruijn graph assembly (dbg) Velvet (2008), CLCbio (2009), ABySS (2009), SOAPdenovo (2010) are applicable to illumina platform 9
Overlap layout consensus (OLC) Software: Newbler (454 platform), SGA 1. Finding overlaps 2. Build overlap graph Bundle stretches of the overlap graph into contigs Pick most likely nucleotide sequence for each contig
Finding overlaps Semiglobal Alignment To find the optimal alignment between suffix (prefix) of S1 with prefix (suffix) of S2 Needleman Wunsch algorithm (Dynamic programming)
Finding overlaps Exact string matching L = 3
suffix tree Finding overlaps
Build overlap graph Find out overlapping relationship between all reads, then draw diagrams reads Overlapping sequences 14
Layout
Layout Hamilton Path It is a graph path between two vertices of a graph that visits each vertex exactly once. An edge (in graph) from the last vertex to the first vertex of the Hamiltonian Path, is so called Hamilton Circuit. B C D A E H F G I 16
Layout Genome: to_every_thing_turn_turn_turn_there_is_a_season (Ben Langmead, http://www.langmead lab.org/teaching materials/)
Layout Genome: to_every_thing_turn_turn_turn_there_is_a_season (Ben Langmead, http://www.langmead lab.org/teaching materials/)
Layout Genome: to_every_thing_turn_turn_turn_there_is_a_season (Ben Langmead, http://www.langmead lab.org/teaching materials/)
Consensus Pick most likely nucleotide sequence for each contig Sequencing error? SNP? Insertion? Deletion? (Ben Langmead, http://www.langmead lab.org/teaching materials/)
Limitation of OLC More than million reads cannot be resolved effectively. 21
Use K mer sequences instead of reads True Genome (You Never Know) reads K mer sequences Break reads into smaller k mer sequences De Bruijn graph assembly (DBG) 22
de Bruijn graph assembly (dbg) Velvet (2008), CLCbio (2009), ABySS (2009), SOAPdenovo (2010) Step 1: sub strings length K of read will be replaced (k mer). A read: which has all 3 mers k =3 AGATGATTCG AGA GAT ATG TGA GAT ATT TTC TCG 23
de Bruijn graph assembly (dbg) Velvet (2008), CLCbio (2009), ABySS (2009), SOAPdenovo (2010) Step 2 : k 1 as vertex, k as edge, draw diagrams, (k 1 appears only once on the diagram) AGATGATTCG K mer AGA, GAT, ATG, TGA, GAT, ATT, TTC, TCG, K 1 AG GA GA AT AT TG TG GA GA AT AT TT TT TC TC CG TGA AGA GAT ATG AG GA AT TG ATT TT TTC TC TCG CG 24
de Bruijn graph assembly (dbg) Velvet (2008), CLCbio (2009), ABySS (2009), SOAPdenovo (2010) Step 3: find Euler Tour in an undirected graph that traverses each edge of the graph exactly once AGATGATTCG AGA GAT ATG TGA GAT ATT TTC TCG AGA GAT ATG AG GA AT TG TT ATT TTC TGA TC TCG CG and go on 25
If it is always assembled in k mer sequences, it would be more efficient to use dbg (Compeau et al., 2011, Nature) OLC dbg 26
Error correction In order to assemble fewer and longer contigs, most assembly programs will modify the result
Error correction 28
dbg algorithm (Velvet Software) Step 1 sequencing (red stands for a sequencing error) Genome The length of Reads is 7 Step 2 Set up retrieving table(k = 4mers), and link all k mer 29
dbg algorithm(velvet Software) Step 3 simplify the graph and link overlapping k mer Simplify the graph: combine the overlapping k mer into a longer sequence. Attention: there are several possible paths by simplifying the graph. Step 4 remove the error path, get four contigs 30
Required conditions for a perfect dbg All k mers can cover the entire genome It is not quite possible, because some areas in genome are not so easy to sequence(gc rich or structure problem ) and some areas are very easy to sequence. It comes out that some areas display many reads in the genome, but some areas shows no reads. All k mers sequences are no errors. It is impossible. So far, the best quality tool illumina can only guarantee till ~80% Q30 (an error appears once in 1000 bases) Each k mer appears only once in the genome It is impossible. Most biological or viral genomes contain varying lengths of repeated sequences. There are ~ 45% repeated sequences in the human genome. References Human Molecular Genetics 4/e 2010 31
Repeats are very problematic in genome assembly With short reads, all the algorithms cannot resolve repeats exactly. OLC read1 read1 read2 read2 read3 read4
Repeats are very problematic in genome assembly dbg: Reads are immediately split into shorter k mers; may not resolve repeats as well as overlap graph 33
The common results of different algorithms, when the sequences repeat String overlap algorithm Graphics algorithms Resources: www.langmead lab.org/teaching materials 34
How to select K in dbg algorithms Finding the optimal balance between sensitivity and graph complexity Guideline for k selection Low coverage: smaller k mer, increased number of overlapping reads that contribute to the graph High coverage: large k mer, no need to be too sensitive, need to reduce graph complexity. 35
In accordance with the number of base pairs, the CLC will automatically determines the length of k mer, max. 64 12 24 on 32 bit computers and 12 64 on 64 bit computers. Resources: http://www.clcsupport.com/clcassemblycell/4 20/index.php?manual=How_it_works.html 36
Comparison of assembly algorithms OLC and dbg OLC low coverage long reads small genome assembly dbg high coverage short reads large genome assembly 37
優點 merit OLC dbg It can analysis varying length sequences from different platforms. High speed, high efficiency It can use overlapping sequences to assemble, high reliability 缺點 fault OLC dbg Very low speed, difficult to calculate If the length of repeat is longer than k mer, there will be an error prone assembly. It s applicable to long read sequencing If there is an error in the sequence, regardless of the size, it lead to bifurcate. A modification is necessary. The assembled genome sometimes would not match the original reads 100%. No assembler/algorithm had consistent good performance in all the statistics. 38
What is N50? 1. After sequence assembly, we get a bunch of contigs 2. According to the length, classify the contigs in descending order. Calculate the sum of the lengths of contigs together. The sum of the lengths 1 2 3 4 5 6 7 89 3. The N50 length is defined as the length N for which 50% of the sum of the lengths of the collection of all contigs. Half of the total length (50%) 1 2 3 4 5 6 7 89 N50 = The length of contig #2 39
The longer of N50 length, the better assembly quality? 50% length 50% length because The N50 of Assembly B >> The N50 of Assembly A Therefore the result of Assembly B is better?? 40
N75 50% length N25 N75 N25 50% length 如果 N50 與 N25 相近, 表示 contig 長度都很長如果 N50 與 N75 相近, 表示 contig 長度中偏短 If the N50 and N25 are similar, it means the lengths of most contigs are long If the N50 and N75 are similar, it means the lengths of most contigs are shorter than the medium length. 41
De novo transcriptome assembly Nature Review Genetics, 2011
Overview of the de novo transcriptome assembly strategy Step1: Generate k mer sequences from the reads (Martin & Wang, Nat. Rev. Genet., 2011)
Overview of the de novo transcriptome Step2: Generate the de Bruijn graph assembly strategy Step3: Simplify the graph the de Bruijn graph (Martin & Wang, Nat. Rev. Genet., 2011)
Overview of the de novo transcriptome assembly strategy Step4: Traverse the graph Step5: Assembled isoforms (Martin & Wang, Nat. Rev. Genet., 2011)
Contrasting Genome and Transcriptome Assembly Genome Assembly Uniform coverage Transcriptome Assembly Exponentially distributed coverage levels Single contig per locus Double stranded Multiple contigs per locus (alternative splicing) Strand specific
Genome Assembly Single Massive Graph Transcriptome Assembly Many Thousands of small Graphs Entire chromosomes represented. Ideally, one graph per expressed gene.
Trinity (Haas et al., Nat Protoc, 2013)
Trinity: RNA Seq De novo Assembly Inchworm assembles reads, generating unique full length transcripts for a dominant isoform (contigs). Chrysalis clusters the contigs and constructs complete de Bruijn graphs for each cluster. Butterfly compacts graph with reads, reporting full length transcripts for alternatively spliced isoforms. (Haas et al., Nat Protoc, 2013) 49
De novo metagenome assembly MetaVelvet software DNA extraction from microbial community Mixed sequence reads of multiple species Contigs or scaffolds for metagenomic sequences Sequencing Assembly (Sakakibara et al., NAR, 2014 )
De novo metagenome assembly DNA extraction from microbial community Mixed sequence reads of multiple species Contigs or scaffolds for metagenomic sequences Sequencing Assembly Advantage: High thoughput sequencing Deep sequencing from low populations Problem: short read length mixture of sequence reads > chimeric assembly
De novo metagenome assembly DNA extraction from microbial community Mixed sequence reads of multiple species Contigs or scaffolds for metagenomic sequences Sequencing Assembly Clustering Single genome assembly
ATGT GGC T T GTC AACA CG GACCGTA Decomposing into subgraphs MetaVelvet strategy Construct a large de Bruijn Graph for mixed reads of multiple species ATGT GTC AACA CG Assembly for a species A Assembly for a species B GGC GTC GACCGTA Assembly for a species C
Problem on metagenome assembly using Velvet Mislabeling node by Velvet if applied to metagenome node of High coverage > mislabeled as Repeat node of Low coverage > mis removed as Error Species C of low coverage (assume = 10) Species B of mid coverage (assume = 30) Species A of high coverage (assume = 60)
心理建設 : 做 de novo assembly 請先看這篇文章 Out of touch with the reality: Before running de novo assembly, please read this article first. 55
不然也看看這篇文章的 BOX 1 A short cut to the whole picture: Box1 56
de novo assembly improvement suggestions Good quality data is key to a successful assembly: Trimming based on quality Trimming Adapters from sequences Scan over many k values (25 65) and pick the one with best N50 High quality data > larger k mer Data with homo polymer errors > smaller k mer Genome + transcriptome assembly can vastly improve assemblies Expect lower quality in difficult regions. Repeats High GC content Bubble Size (Using CLC): If you do not expect a repetitive genome > higher bubble size If your sequence quality is not good > higher bubble size if you anticipate more repeats > smaller bubble size
Don t take as Gospel the output of an assembly program, Benedict Paten Assistant Research Scientist, University of California, Santa Cruz If your paper is going to rely on that, it is absolutely essential that you do PCR and other follow up experiments.
Thank you for your attention~ My Email: paul@imb.sinica.edu.tw Rm.N107 IMB BSC, No.128 Academia Road, Section 2, Nankang, Taipei 115, Taiwan R.O.C Bioformatics Core @ IMB TEL:886 2 2789 9967