de novo metagenome assembly
|
|
- Gerard Malone
- 5 years ago
- Views:
Transcription
1 1 de novo metagenome assembly Rayan Chikhi CNRS Univ. Lille 1 Formation metagenomique
2 de novo metagenomics 2
3 de novo metagenomics Goal: biological sense out of sequencing data Techniques: 1. de novo assembly 2. contigs binning 3. taxonomic assignment or 1. direct comparison of n datasets (e.g. similarity matrix) 2. extract clusters 3
4 Sequencing data 4 Illumina, short reads. Preferable but harder-to-get data: - PacBio, Nanopore - 10X - Hi-C
5 Various levels of genetic diversity [Konstantinidis 06] 5
6 Various levels of metagenome complexity ftp://ftp.imicrobe.us/abe487/lectures/mln_wk5_assembly.pdf 6
7 When to assemble 7
8 Assembly difficulties 8 1. closely related strains 2. uneven depths, & low depths 3. inter-species repeats 4. large size of datasets 5. lack of long reads 6. scalability of tools (adapted from A. Korobeynikov)
9 Outline of metagenomic assemblers 9 - Construct a de Bruijn graph - Remove sequencing errors and small variants from the graph - Output unambiguous paths of the graph (contigs). - Use paired-end information to join contigs (scaffolds)
10 10 de Bruijn graphs A de Bruijn graph for a fixed integer k: 1. Nodes = all k-mers (substrings of length k) in the reads. 2. There is an edge between x and y if the (k 1)-mer prefix of y matches exactly the (k 1)-mer suffix of x. AGCCTGA AGCATGA dbg, k = 3: GCC CCT CTG AGC TGA GCA CAT ATG
11 de Bruijn graphs 11 A de Bruijn graph for a fixed integer k: 1. Nodes = all k-mers (substrings of length k) in the reads. 2. There is an edge between x and y if the (k 1)-mer prefix of y matches exactly the (k 1)-mer suffix of x. ACTG CTGC TGCT GCTG CTGA TGAT dbg, k = 3: ACT CTG TGC GAT TGA GCT
12 Compacted de Bruijn graph 12 Compacted de Bruijn graph: TCATTG TGGTAA TGCGAA AACCG Each non-branching path becomes a single node (unitig). - no loss of information - less space
13 Data is messy 13 dbg of S. aureus, uncleaned (SRR022865) it s a SMALL genome
14 Detailed steps of metagenomic assemblers TB reads.gz k-mer counting Recent progress, Stand-alone software 700 GB k-mers graph compaction High-memory or slow 30 GB unitigs graph cleaning Heuristics 20 Gbp spruce [Birol 2013]... - computationally intensive - bottlenecks at early stages
15 Multi-k 15 Input reads Assembler k=21 Assembler k=55 Assembler k=77 Final assembly In principle, better than single-k assembly. Notable assemblers that implement multi-k: - IDBA, metaspades, Megahit Notable assemblers that don t: - Velvet, SOAPdenovo, Trinity, ABySS
16 Metagenomic assemblers 16 metaspades, MEGAHIT, Minia, Ray Meta, MetaVelvet, metamos, IDBA-UD, Meta-IDBA, Omega, Genovo, InteMAP, MIRA, CLC, SOAPdenovo2 (sort of)...
17 Assemblers parameters 17 - k value(s) - coverage(s) cut-off or - no parameter
18 Effect of k-mer size 18 Salmonella genome, cleaned assembly (Velvet), 100 bp Illumina reads. k = 51
19 Effect of k-mer size 18 Salmonella genome, cleaned assembly (Velvet), 100 bp Illumina reads. k = 61
20 Effect of k-mer size Salmonella genome, cleaned assembly (Velvet), 100 bp Illumina reads. k =
21 Effect of k-mer size Salmonella genome, cleaned assembly (Velvet), 100 bp Illumina reads. k =
22 Effect of k-mer size Salmonella genome, cleaned assembly (Velvet), 100 bp Illumina reads. k =
23 Visualization of multi-k graphs Salmonella genome, cleaned assembly (SPAdes), MiSeq reads. k = 21 19
24 Visualization of multi-k graphs 19 Salmonella genome, cleaned assembly (SPAdes), MiSeq reads. k = 55
25 Visualization of multi-k graphs 19 Salmonella genome, cleaned assembly (SPAdes), MiSeq reads. k = 99
26 Large metagenome graph Source: Bandage wiki 20
27 Large metagenome graph (zoom) Source: Bandage wiki 21
28 Bandage sequence search Searching a large transposon Source: Bandage wiki 22
29 23 Graph formats - FASTG - GFA - GFA2 H VN:Z:1.0 S 11 ACCTT S 12 TCAAGG S 13 CTTGATT L M L M L M P ,12-,13+ 4M,5M ++ ACCTT CTTGATT TCAAGG
30 Graph simplifications (here, SPAdes-inspired) 24 Tip removal: len tip 3.5k or len tip 10k 2cov tip cov neighbors Bulge removal: len bulge max(3k, 100) cov bulge 1.1cov altpath len altpath = len bulge ± delta delta = max(0.1len bulge, 3) Erroneous connection removal: len EC 10k 4cov EC cov neighbors
31 the Minia pipeline 25 Input reads.fq.gz Bloocoo error-correction.fq.gz or, direct assembly of raw reads DSK 3 k-mer counting.h5 BCALM 2.1 unitigs assembly.fa/.gfa Minia 3 contigs assembly.fa/.gfa multi-k contigs assembly Contigs.fa BESST 2 scaffolding.fa Scaffolds.fa
32 the SPAdes assembler SPAdes: multi-k de Bruijn graph assembler, no graph coverage cut-off, graph simplifications. exspander: universal module for repeat resolution and scaffolding metaspades 1 : - performance enhancements - coverage-aware utilization of paired reads in repeat resolution - some tuning in graph simplifications 1 from A. Korobeynikov slides 26
33 metaspades figure from A. Korobeynikov 27
34 the MEGAHIT assembler 28 MEGAHIT: HKU metagenomic assembler. Steps: 1. memory-intensive k-mer counting 2. succinct de Bruijn graph (SDBG) construction 3. mercy k-mers 4. graph simplifications (now using IDBA s) 5. scaffolding 6. multi-k iterations
35 A recent benchmark 29
36 CAMI metagenomics assembly a = all genomes b = genomes with ANI >= 95% c = genomes with ANI < 95% 30
37
38 CAMI metagenomics assembly 32 Conclusions - Assemblers using multiple k-mers (Minia, MEGAHIT and Meraga) substantially outperformed single k-mer assemblers - metagenome assemblers produced high-quality results for genomes without close relatives, while only a small fraction of the common strain genomes was assembled
39 Other metrics 33 Same as regular assembly: N50, NG50 Total size % of reads mapping correctly back to the assembly Number of predicted genes % of contigs matching some known reference Further processing (binning, taxonomic profiling)
40 N50 = Largest contig length at which that contig and longer contigs cover 50% of the total assembly length N50 NG50 = Largest contig length at which that contig and longer contigs cover 50% of the total genome length A practical way to compute N50: - Sort contigs by decreasing lengths - Take the first contig (the largest): does it cover 50% of the assembly? - If yes, its length is the N50 value. - Else, consider the two largest contigs, do they cover 50%? - If yes, then the N50 is the length of the second largest contig. - And so on.. 34
41 35 Simka [Benoit et al, 2016] Full HMP data: n=690 samples, (32 billion reads) 0.5 day computation k-mer composition Abundance table A B N ACGAG CGAGC n sequencing datasets GAGCT >read0 ACACGTCAACGGCTACGGCAGA CTAACGACTCAGTC >read1 ACGAGCATCAGGGGACGTATGT TATGCAGTTCCAGG >read0 ACACGTCAACGGCTACGGCAGA CTAACGACTCAGTC >read1 ACGAGCATCAGGGGACGTATGT TATGCAGTTCCAGG >read0 ACACGTCAACGGCTACGGCAGA CTAACGACTCAGTC >read1 ACGAGCATCAGGGGACGTATGT TATGCAGTTCCAGG Feature extraction - (k-mer counting) Jaccard, Bray- Curtis,.. similarity matrix
42 When can we assemble 36
43 Digital normalization 37 Method using k-mers to normalize in silico the coverage of organisms. Enables to: - reduce the size of the dataset - more efficient assembly - (potentially) facilitate assembly of high-coverage material Potential drawbacks: more fragmented assembly, collapses heterozygosity
44 Conclusion 38 - de novo metagenomics is biologically and computationally challenging - Assemblers: metaspades (high quality), MEGAHIT, Minia - Alignment-free and assembly-free analysis: Simka References: CAMI benchmark article, metaspades article Lab: run metaspades, (break), evaluate results