de novo metagenome assembly

Size: px
Start display at page:

Download "de novo metagenome assembly"

Transcription

1 1 de novo metagenome assembly Rayan Chikhi CNRS Univ. Lille 1 Formation metagenomique

2 de novo metagenomics 2

3 de novo metagenomics Goal: biological sense out of sequencing data Techniques: 1. de novo assembly 2. contigs binning 3. taxonomic assignment or 1. direct comparison of n datasets (e.g. similarity matrix) 2. extract clusters 3

4 Sequencing data 4 Illumina, short reads. Preferable but harder-to-get data: - PacBio, Nanopore - 10X - Hi-C

5 Various levels of genetic diversity [Konstantinidis 06] 5

6 Various levels of metagenome complexity ftp://ftp.imicrobe.us/abe487/lectures/mln_wk5_assembly.pdf 6

7 When to assemble 7

8 Assembly difficulties 8 1. closely related strains 2. uneven depths, & low depths 3. inter-species repeats 4. large size of datasets 5. lack of long reads 6. scalability of tools (adapted from A. Korobeynikov)

9 Outline of metagenomic assemblers 9 - Construct a de Bruijn graph - Remove sequencing errors and small variants from the graph - Output unambiguous paths of the graph (contigs). - Use paired-end information to join contigs (scaffolds)

10 10 de Bruijn graphs A de Bruijn graph for a fixed integer k: 1. Nodes = all k-mers (substrings of length k) in the reads. 2. There is an edge between x and y if the (k 1)-mer prefix of y matches exactly the (k 1)-mer suffix of x. AGCCTGA AGCATGA dbg, k = 3: GCC CCT CTG AGC TGA GCA CAT ATG

11 de Bruijn graphs 11 A de Bruijn graph for a fixed integer k: 1. Nodes = all k-mers (substrings of length k) in the reads. 2. There is an edge between x and y if the (k 1)-mer prefix of y matches exactly the (k 1)-mer suffix of x. ACTG CTGC TGCT GCTG CTGA TGAT dbg, k = 3: ACT CTG TGC GAT TGA GCT

12 Compacted de Bruijn graph 12 Compacted de Bruijn graph: TCATTG TGGTAA TGCGAA AACCG Each non-branching path becomes a single node (unitig). - no loss of information - less space

13 Data is messy 13 dbg of S. aureus, uncleaned (SRR022865) it s a SMALL genome

14 Detailed steps of metagenomic assemblers TB reads.gz k-mer counting Recent progress, Stand-alone software 700 GB k-mers graph compaction High-memory or slow 30 GB unitigs graph cleaning Heuristics 20 Gbp spruce [Birol 2013]... - computationally intensive - bottlenecks at early stages

15 Multi-k 15 Input reads Assembler k=21 Assembler k=55 Assembler k=77 Final assembly In principle, better than single-k assembly. Notable assemblers that implement multi-k: - IDBA, metaspades, Megahit Notable assemblers that don t: - Velvet, SOAPdenovo, Trinity, ABySS

16 Metagenomic assemblers 16 metaspades, MEGAHIT, Minia, Ray Meta, MetaVelvet, metamos, IDBA-UD, Meta-IDBA, Omega, Genovo, InteMAP, MIRA, CLC, SOAPdenovo2 (sort of)...

17 Assemblers parameters 17 - k value(s) - coverage(s) cut-off or - no parameter

18 Effect of k-mer size 18 Salmonella genome, cleaned assembly (Velvet), 100 bp Illumina reads. k = 51

19 Effect of k-mer size 18 Salmonella genome, cleaned assembly (Velvet), 100 bp Illumina reads. k = 61

20 Effect of k-mer size Salmonella genome, cleaned assembly (Velvet), 100 bp Illumina reads. k =

21 Effect of k-mer size Salmonella genome, cleaned assembly (Velvet), 100 bp Illumina reads. k =

22 Effect of k-mer size Salmonella genome, cleaned assembly (Velvet), 100 bp Illumina reads. k =

23 Visualization of multi-k graphs Salmonella genome, cleaned assembly (SPAdes), MiSeq reads. k = 21 19

24 Visualization of multi-k graphs 19 Salmonella genome, cleaned assembly (SPAdes), MiSeq reads. k = 55

25 Visualization of multi-k graphs 19 Salmonella genome, cleaned assembly (SPAdes), MiSeq reads. k = 99

26 Large metagenome graph Source: Bandage wiki 20

27 Large metagenome graph (zoom) Source: Bandage wiki 21

28 Bandage sequence search Searching a large transposon Source: Bandage wiki 22

29 23 Graph formats - FASTG - GFA - GFA2 H VN:Z:1.0 S 11 ACCTT S 12 TCAAGG S 13 CTTGATT L M L M L M P ,12-,13+ 4M,5M ++ ACCTT CTTGATT TCAAGG

30 Graph simplifications (here, SPAdes-inspired) 24 Tip removal: len tip 3.5k or len tip 10k 2cov tip cov neighbors Bulge removal: len bulge max(3k, 100) cov bulge 1.1cov altpath len altpath = len bulge ± delta delta = max(0.1len bulge, 3) Erroneous connection removal: len EC 10k 4cov EC cov neighbors

31 the Minia pipeline 25 Input reads.fq.gz Bloocoo error-correction.fq.gz or, direct assembly of raw reads DSK 3 k-mer counting.h5 BCALM 2.1 unitigs assembly.fa/.gfa Minia 3 contigs assembly.fa/.gfa multi-k contigs assembly Contigs.fa BESST 2 scaffolding.fa Scaffolds.fa

32 the SPAdes assembler SPAdes: multi-k de Bruijn graph assembler, no graph coverage cut-off, graph simplifications. exspander: universal module for repeat resolution and scaffolding metaspades 1 : - performance enhancements - coverage-aware utilization of paired reads in repeat resolution - some tuning in graph simplifications 1 from A. Korobeynikov slides 26

33 metaspades figure from A. Korobeynikov 27

34 the MEGAHIT assembler 28 MEGAHIT: HKU metagenomic assembler. Steps: 1. memory-intensive k-mer counting 2. succinct de Bruijn graph (SDBG) construction 3. mercy k-mers 4. graph simplifications (now using IDBA s) 5. scaffolding 6. multi-k iterations

35 A recent benchmark 29

36 CAMI metagenomics assembly a = all genomes b = genomes with ANI >= 95% c = genomes with ANI < 95% 30

37

38 CAMI metagenomics assembly 32 Conclusions - Assemblers using multiple k-mers (Minia, MEGAHIT and Meraga) substantially outperformed single k-mer assemblers - metagenome assemblers produced high-quality results for genomes without close relatives, while only a small fraction of the common strain genomes was assembled

39 Other metrics 33 Same as regular assembly: N50, NG50 Total size % of reads mapping correctly back to the assembly Number of predicted genes % of contigs matching some known reference Further processing (binning, taxonomic profiling)

40 N50 = Largest contig length at which that contig and longer contigs cover 50% of the total assembly length N50 NG50 = Largest contig length at which that contig and longer contigs cover 50% of the total genome length A practical way to compute N50: - Sort contigs by decreasing lengths - Take the first contig (the largest): does it cover 50% of the assembly? - If yes, its length is the N50 value. - Else, consider the two largest contigs, do they cover 50%? - If yes, then the N50 is the length of the second largest contig. - And so on.. 34

41 35 Simka [Benoit et al, 2016] Full HMP data: n=690 samples, (32 billion reads) 0.5 day computation k-mer composition Abundance table A B N ACGAG CGAGC n sequencing datasets GAGCT >read0 ACACGTCAACGGCTACGGCAGA CTAACGACTCAGTC >read1 ACGAGCATCAGGGGACGTATGT TATGCAGTTCCAGG >read0 ACACGTCAACGGCTACGGCAGA CTAACGACTCAGTC >read1 ACGAGCATCAGGGGACGTATGT TATGCAGTTCCAGG >read0 ACACGTCAACGGCTACGGCAGA CTAACGACTCAGTC >read1 ACGAGCATCAGGGGACGTATGT TATGCAGTTCCAGG Feature extraction - (k-mer counting) Jaccard, Bray- Curtis,.. similarity matrix

42 When can we assemble 36

43 Digital normalization 37 Method using k-mers to normalize in silico the coverage of organisms. Enables to: - reduce the size of the dataset - more efficient assembly - (potentially) facilitate assembly of high-coverage material Potential drawbacks: more fragmented assembly, collapses heterozygosity

44 Conclusion 38 - de novo metagenomics is biologically and computationally challenging - Assemblers: metaspades (high quality), MEGAHIT, Minia - Alignment-free and assembly-free analysis: Simka References: CAMI benchmark article, metaspades article Lab: run metaspades, (break), evaluate results