De novo sequence assembly - PDF Free Download

2015.11.17 De novo sequence assembly 徐唯哲 Paul Wei-Che HSU 中央研究院分子生物研究所研究助技師 Assistant Research Specialist Bioinformatics Service Core, Institute of Molecular Biology, Academia Sinica, Taiwan, R.O.C. Bioinformatics Service Core 1

De novo sequence assembly Genome assembly Transcriptome assembly Metagenome assembly

De novo genome assembly Unknown Genome Shotgun sequencing DNA is sheared into random fragments (reads or tags) assembly 3

Shortest common superstring (SCS) Given a collection of strings S, find SCS(S): the shortest string that contains all strings in S as substrings Example: S: BAA AAB BBA ABA ABB BBB AAA BAB Concatenation: BAAAABBBAABAABBBBBAAABAB 24 Without requirement of shortest SCS(S): AAABBBABAA 10 AAA AAB ABB BBB BBA BAB ABA BAA Finding overlap (Ben Langmead, http://www.langmead-lab.org/teaching-materials/)

Semiglobal Alignment Finding overlaps Exact string matching Suffix tree

Semiglobal Alignment Needleman Wunsch algorithm (Dynamic programming) Initialize first row to 0s Answer is maximum score in bottom row Trace back starts from maximum score until it falls off top side ACTG CTG

L = 3 Exact string matching

Suffix tree Generalized suffix tree for GACATA ATAGAC GACATA$0ATAGAC$1 0 1 2 3 4 5 6 7 8 9 10 11 12 13 5 $0 C TA GAC$1 9 6 A $0 ATA$0 C $1 $1 13 TA GAC ATA$0 2 12 0 10 $1 4 $0 GAC$1 8 ATA$0 1 11 $1 $0 3 GAC$1 7 GACATA GACATA ATAGAC ATAGAC GACATA

String overlap alogrithm Greedy-extension algorithm Identify overlapping area (select the highest score) Finding overlaps Merge overlapping sequences merging Identify overlapping area again, then merge (rerun again) Until sequences cannot be merged anymore 9

Greedy-extension algorithm (String-based assemblers ) SSAKE (2007), SHARCGS (2007), QSRA (2009) are applicable to illumina platform More time-consuming, suitable for small amount of reads(low throughput), smaller genomes Greedy algorithm is not guaranteed to choose overlaps yielding SCS, but is a good approximation. 10

Shortest common superstring: Using Greedy-extension algorithm Greedy-SCS algorithm in action Input strings ABA ABB AAA AAB BBB BBA BBB 2 BAAB ABA ABB AAA BBB BBA BAB 2 BABB BABB ABA AAA BBB BBA 2 BBAAB 2 BBBAAB BABB BABB ABA ABA AAA BBB AAA 2 BBBAABA BABB AAA 2 BABBBAABA AAA 1 BABBBAABAAA BABBBAABAAA Superstring BAA In red are strings that get merged before the next round Greedy answer: BABBBAABAAA Actual SCS: AAABBBABAA Rounds of merging, one merge per line. Number in first column = length of overlap merged before that round (Ben Langmead, http://www.langmead-lab.org/teaching-materials/)

Graph-based assemblers High speed, suitable for big amount of reads(high throughput), bigger genomes Overlap-layout-consensus (OLC) Newbler (2006, 454 platform), Forge(2009, 454+ illumina) de Bruijn graph assembly (dbg) Velvet (2008), CLCbio (2009), ABySS (2009), SOAPdenovo (2010) are applicable to illumina platform 12

Overlap-layout-consensus (OLC) Software: Newbler (454 platform), SGA 1. Finding overlaps 2. Build overlap graph Bundle stretches of the overlap graph into contigs Pick most likely nucleotide sequence for each contig

Finding overlaps Semiglobal Alignment Exact string matching Suffix tree

Build overlap graph Find out overlapping relationship between all reads, then draw diagrams reads Overlapping sequences 15

Layout

Layout Hamilton Path It is a graph path between two vertices of a graph that visits each vertex exactly once. An edge (in graph) from the last vertex to the first vertex of the Hamiltonian Path, is so called Hamilton Circuit. B C D A F E G H I 17

Layout Genome: to_every_thing_turn_turn_turn_there_is_a_season (Ben Langmead, http://www.langmead-lab.org/teaching-materials/)

Consensus Pick most likely nucleotide sequence for each contig Deletion? Sequencing error? SNP? Insertion? (Ben Langmead, http://www.langmead-lab.org/teaching-materials/)

Limitation of OLC More than million reads cannot be resolved effectively. 22

Efficient way? Indexing Comparison of one-to-one

Use K-mer sequences instead of reads True Genome (You Never Know) reads K-mer sequences Break reads into smaller k-mer sequences De Bruijn graph assembly (DBG) 24

de Bruijn graph assembly (dbg) Velvet (2008), CLCbio (2009), ABySS (2009), SOAPdenovo (2010) Step 1: sub-strings length K of read will be replaced (k-mer). A read: which has all 3-mers k =3 AGATGATTCG AGA GAT ATG TGA GAT ATT TTC TCG 25

de Bruijn graph assembly (dbg) Velvet (2008), CLCbio (2009), ABySS (2009), SOAPdenovo (2010) Step 2 : k-1 as vertex, k as edge, draw diagrams, (k-1 appears only once on the diagram) AGATGATTCG K-mer AGA, GAT, ATG, TGA, GAT, ATT, TTC, TCG, K-1 AG GA GA AT AT TG TG GA GA AT AT TT TT TC TC CG TGA AGA GAT ATG AG GA AT TG ATT TT TTC TC TCG CG 26

de Bruijn graph assembly (dbg) Velvet (2008), CLCbio (2009), ABySS (2009), SOAPdenovo (2010) Step 3: find Euler Tour in an undirected graph that traverses each edge of the graph exactly once AGATGATTCG AGA GAT ATG TGA GAT ATT TTC TCG AGA GAT ATG AG GA AT TG TT ATT TTC TGA TC TCG CG and go on 27

If it is always assembled in k-mer sequences, it would be more efficient to use dbg (Compeau et al., 2011, Nature) OLC dbg 28

Error correction In order to assemble fewer and longer contigs, most assembly programs will modify the result

Error correction 30

dbg algorithm (Velvet Software) Step 1 sequencing (red stands for a sequencing error) Genome The length of Reads is 7 Step 2 Set up retrieving table(k = 4mers), and link all k-mer 31

dbg algorithm(velvet Software) Step 3 simplify the graph and link overlapping k-mer Simplify the graph: combine the overlapping k-mer into a longer sequence. Attention: there are several possible paths by simplifying the graph. Step 4 remove the error path, get four contigs 32

Required conditions for a perfect dbg All k-mers can cover the entire genome It is not quite possible, because some areas in genome are not so easy to sequence(gc rich or structure problem ) and some areas are very easy to sequence. It comes out that some areas display many reads in the genome, but some areas shows no reads. All k-mers sequences are no errors. It is impossible. So far, the best quality tool illumina can only guarantee till ~80% Q30 (an error appears once in 1000 bases) Each k-mer appears only once in the genome It is impossible. Most biological or viral genomes contain varying lengths of repeated sequences. There are ~ 45% repeated sequences in the human genome. References Human Molecular Genetics 4/e 2010 34

Repeats are very problematic in genome assembly With short reads, all the algorithms cannot resolve repeats exactly. OLC read1 read1 read2 read2 read3 read4

Repeats are very problematic in genome assembly dbg: Reads are immediately split into shorter k-mers; may not resolve repeats as well as overlap graph 36

The common results of different algorithms, when the sequences repeat String overlap algorithm Graphics algorithms Resources: www.langmead-lab.org/teaching-materials 37

How to select K in dbg algorithms Finding the optimal balance between sensitivity and graph complexity Guideline for k-selection Low coverage: smaller k-mer, increased number of overlapping reads that contribute to the graph High coverage: large k-mer, no need to be too sensitive, need to reduce graph complexity. 38

In accordance with the number of base pairs, the CLC will automatically determines the length of k-mer, max. 64 12-24 on 32-bit computers and 12-64 on 64-bit computers. Resources: http://www.clcsupport.com/clcassemblycell/4 20/index.php?manual=How_it_works.html 39

Comparison of assembly algorithms OLC and dbg OLC low-coverage long reads small genome assembly dbg high-coverage short reads large genome assembly 40

優點 merit OLC dbg It can analysis varying length sequences from different platforms. High speed, high efficiency It can use overlapping sequences to assemble, high reliability 缺點 fault OLC dbg Very low speed, difficult to calculate If the length of repeat is longer than k-mer, there will be an error-prone assembly. It s applicable to long read sequencing If there is an error in the read, regardless of the size, it lead to bifurcate. A modification is necessary. The assembled genome sometimes would not match the original reads 100%. 如果 read 序列上有錯誤, 不管大小都會造成圖形分岔, 要進行修改 No assembler/algorithm had consistent good performance in all the statistics. 41

What is N50? 1. After sequence assembly, we get a bunch of contigs 2. According to the length, classify the contigs in descending order. Calculate the sum of the lengths of contigs together. The sum of the lengths 1 2 3 4 5 6 7 8 9 3. The N50 length is defined as the length N for which 50% of the sum of the lengths of the collection of all contigs. Half of the total length (50%) 1 2 3 4 5 6 7 8 9 N50 = The length of contig #2 42

The longer of N50 length, the better assembly quality? 50% length 50% length because The N50 of Assembly B >> The N50 of Assembly A Therefore the result of Assembly B is better?? 43

N75 50% length N25 N75 N25 50% length 如果 N50 與 N25 相近, 表示 contig 長度都很長如果 N50 與 N75 相近, 表示 contig 長度中偏短 If the N50 and N25 are similar, it means the lengths of most contigs are long If the N50 and N75 are similar, it means the lengths of most contigs are shorter than the medium-length. 44

De novo transcriptome assembly Nature Review Genetics, 2011

Overview of the de novo transcriptome assembly strategy Step1: Generate k-mer sequences from the reads (Martin & Wang, Nat. Rev. Genet., 2011)

Overview of the de novo transcriptome Step2: Generate the de Bruijn graph assembly strategy Step3: Simplify the graph the de Bruijn graph (Martin & Wang, Nat. Rev. Genet., 2011)

Overview of the de novo transcriptome assembly strategy Step4: Traverse the graph Step5: Assembled isoforms (Martin & Wang, Nat. Rev. Genet., 2011)

Contrasting Genome and Transcriptome Assembly Genome Assembly Uniform coverage Transcriptome Assembly Exponentially distributed coverage levels Single contig per locus Double-stranded Multiple contigs per locus (alternative splicing) Strand-specific

Genome Assembly Single Massive Graph Transcriptome Assembly Many Thousands of small Graphs Entire chromosomes represented. Ideally, one graph per expressed gene.

Trinity (Haas et al., Nat Protoc, 2013)

Trinity: RNA-Seq De novo Assembly RNA-Seq reads Linear contigs De-Bruijn graphs Transcripts + isoforms (Haas et al., Nat Protoc, 2013) 52

Inchworm Step1: Decompose all reads into k-mers (k=25). Step2: Identify seed k-mer as most abundant k-mer, ignoring low-complexity k-mer. Step3: Extend k-mer at 3 -end, guided by coverage. Step4: Remove assembled k-mers from catalog, then repeat the entire process. G 0 A 5 11 C 0 9 G 4 A 1 AAAATT A 7 T 0 A 6 G 1 GATTACA C 4 T 0 G 1 T 1 C 0 A 1 C 1 T 1 Report contig: AAGATTACAGA

Chrysalis Chrysalis pools Inchworm contigs and overlap linear sequences by overlaps of k-1 to build graph components Integrate isoforms via k-1 overlaps (Haas et al., Nat Protoc, 2013)

Butterfly compacting Build dbg graphs. Ideally, one per gene

De novo metagenome assembly MetaVelvet software DNA extraction from microbial community Mixed sequence reads of multiple species Contigs or scaffolds for metagenomic sequences Sequencing Assembly (Sakakibara et al., NAR, 2014 )

De novo metagenome assembly DNA extraction from microbial community Mixed sequence reads of multiple species Contigs or scaffolds for metagenomic sequences Sequencing Assembly Clustering Single genome assembly (Sakakibara et al., NAR, 2014 )

ATGT GTC T T AACA CG Construct a large de Bruijn Graph for mixed reads of multiple species GGC GACCGTA Decomposing into subgraphs ATGT GTC AACA CG Assembly for a species A Assembly for a species B GGC GTC GACCGTA Assembly for a species C

Velvet vs. MetaVelvet De Bruijn graph of metagenome assembly Low coverage (assume = 10) Species A (MetaVelvet) mis-removed as Error (Velvet) Species B (MetaVelvet) mid coverage (assume = 30) high coverage (assume = 60) Species C (MetaVelvet) mislabeled as Repeat (Velvet)

心理建設 : 做 de novo assembly 請先看這篇文章 Out of touch with the reality: Before running de novo assembly, please read this article first. 60

不然也看看這篇文章的 BOX 1 A short cut to the whole picture: Box1 61

de novo assembly improvement suggestions Good quality data is key to a successful assembly: Trimming based on quality Trimming Adapters from sequences Scan over many k-values (25-65) and pick the one with best N50 High quality data -> larger k-mer Data with homo-polymer errors -> smaller k-mer Genome + transcriptome assembly can vastly improve assemblies Expect lower quality in difficult regions. Repeats High GC content Bubble Size (Using CLC): If you do not expect a repetitive genome -> higher bubble size If your sequence quality is not good -> higher bubble size if you anticipate more repeats -> smaller bubble size

Bubble Size (Using CLC) Increasing the bubble size also increases the change of misassemblies. CLCbio Manual

Don t take as Gospel the output of an assembly program, Benedict Paten Assistant Research Scientist, University of California, Santa Cruz If your paper is going to rely on that, it is absolutely essential that you do PCR and other follow-up experiments.

Thank you for your attention~ My Email: paul@imb.sinica.edu.tw Rm.N107 IMB BSC, No.128 Academia Road, Section 2, Nankang, Taipei 115, Taiwan R.O.C Bioformatics Core @ IMB TEL:886-2-2789-9967