Introduction to metagenome assembly. Bas E. Dutilh Metagenomic Methods for Microbial Ecologists, NIOO September 18 th 2014

Size: px
Start display at page:

Download "Introduction to metagenome assembly. Bas E. Dutilh Metagenomic Methods for Microbial Ecologists, NIOO September 18 th 2014"

Transcription

1 Introduction to metagenome assembly Bas E. Dutilh Metagenomic Methods for Microbial Ecologists, NIOO September 18 th 2014

2 Sequencing specs* Method Read length Accuracy Million reads Time Cost per M % 1 1 day $10 Illumina % 3, days $0.10 IonTorrent % hours $1 PacBio 1,000-30,000 87% hours $1 Sanger 400-1, % n/a 2 hours $2,400 SOLiD % 1, weeks $0.13 * these numbers change all the time!

3 Lengths of reads and genomes NGS technologies provide reads of 50 to max. 30,000 bp, but most genomes are much longer Gago, Science 2009

4 Random genome, random coverage TAGTATTATTGCTGCTCATAAAGTAGCTCCAGCTCATCTTGATACTAATGCTTTTTGTAATCTTATTGGTTGGCTTAAACCTAAAAGAGTTGAAGTTAA Average depth: Genome size G Base depth B=40x Read length L=100 bp K-mer size K=25 bp C = B * (L - K + 1) / L Uncovered bases: u = G * e C

5 Sequence assembly Reads Consensus sequence of assembled reads Includes alignment of all reads Contigs Order and orientation of contigs Sizes of the gaps between contigs (filled with NNN) Scaffold

6 Coverage Depth Horizontal coverage

7 Assembly of shotgun sequences Human genome project 1-2 kb Sanger reads < 10x coverage Low error rate High-throughput (meta-)genomics Millions/billions of ~ bp reads Mix of genomes with different coverage Biases and sequencing errors 2000 NOW Quality drops towards the end of reads Homo-polymers may be miss-called in 454 or Ion Torrent

8 Assembly strategies Reference-guided assembly Align reads to a (database) of reference genome(s) Cannot discover: Larger genomic mutations Insertions, deletions, rearrangements Distantly related species Most viruses De novo assembly Requires sufficient coverage x depth Breaks on repeats and low-coverage regions Algorithms Greedy assembly (only to illustrate) Overlap-layout-consensus De Bruijn graph

9 Reference-guided assembly Illumina sequencing of community DNA Same-species genome available (2.8M nt) Sometimes, only a minority of the reads can be mapped/aligned

10 Distant reference Natural diversity of community Species share >94% average nucleotide identity Consensus = average of the species Konstantinidis and Tiedje, PNAS 2004 Reference Consensus Genome space

11 Iterative mapping and assembly The assembly is a better representation of the community Reference Genome space First assembly Consensus Can we further approach the consensus genome by remapping the reads against this first assembly? Dutilh et al. Bioinformatics 2009

12 Iteration improves assembly More mapped reads Fewer gaps Dutilh et al. Bioinformatics 2009

13 De novo assembly AACAAGT CAAGTTA Assembly: AACAAGTTA

14 De novo assembly approaches Greedy approach Overlap-layout-consensus De Bruijn graphs

15 Greedy assembly 1. Sequences (reads/contigs) (reads) 2. Pairwise all-vs-all similarities 3. Find best matching pair 4. Collapse/assemble Works well for few, long reads (Sanger) All-vs-all calculations are expensive One clear best match Does not work for high throughput NGS datasets Many reads -> expensive to calculate Low coverage requires graph approach

16 Repetitive sequences repeat repeat A B C D B D A D B C Reads A-D are from a region with two long repeats Greedy approach would first join A-D with the largest overlap, and place B-C in a separate contig Resolving this requires a global view of all the possibilities before joining two reads: a graph

17 Assembly as a graph problem Overlap-layout-consensus De Bruijn Graph A graph contains nodes and edges node edge

18 Overlap-layout-consensus TAGTATTATTGCTGCTCATAAAGTAGCTCCAGCTCATCTTGATACTAATGCTTTTTGTAATCTTATTGGTTGGCTTAAACCTAAAAGAGTTGAAGTTAA 1. Identify all overlaps between reads Use cutoffs: minimum overlap and percent identity K CTCATAAAGTAGCTCCAGCTCATCTTGATACTAATGCTTTTTGTAATCT TTTGTAATCTTATTGGTTGGCTTAAACCTAAAAGAGTTGAAGTTAA J L GTAGCTCCAGCTCATCTTGATACTAATGCTTTTTGTAATCTTATTGGTT CTTGATACTAATGCTTTTTGTAATCTTATTGGTTGGCTTAAAC M TAGTATTATTGCTGCTCATAAAGTAGCTCCAGCTCATCTTGATACTAAT 2. Make graph of overlap connections Nodes: reads Edges: overlaps 3. Find Hamiltonian path Path that contains every node once No efficient algorithm available J K L M N 4. Determine consensus at each position N

19 De Bruijn graph TAGTATTATTGCTGCTCATAAAGTAGCTCCAGCTCATCTTGATACTAATGCTTTTTGTAATCTTATTGGTTGGCTTAAACCTAAAAGAGTTGAAGTTAA 1. Find every word of length k (k-mer) in every read J K-mer should be long enough to be quite unique, but short enough to not break on polymorphisms/errors K CTCATAAAGTAGCTCCAGCTCATCTTGATACTAATGCTTTTTGTAATCT L GTAGCTCCAGCTCATCTTGATACTAATGCTTTTTGTAATCTTATTGGTT TTTGTAATCTTATTGGTTGGCTTAAACCTAAAAGAGTTGAAGTTAA CTTGATACTAATGCTTTTTGTAATCTTATTGGTTGGCTTAAAC TAGTATTATTGCTGCTCATAAAGTAGCTCCAGCTCATCTTGATACTAAT CTTGATACTAATGCTTTTTGTAATCTTAT TTGATACTAATGCTTTTTGTAATCTTATT TGATACTAATGCTTTTTGTAATCTTATTG GATACTAATGCTTTTTGTAATCTTATTGG ATACTAATGCTTTTTGTAATCTTATTGGT TACTAATGCTTTTTGTAATCTTATTGGTT ACTAATGCTTTTTGTAATCTTATTGGTTG CTAATGCTTTTTGTAATCTTATTGGTTGG TAATGCTTTTTGTAATCTTATTGGTTGGC AATGCTTTTTGTAATCTTATTGGTTGGCT ATGCTTTTTGTAATCTTATTGGTTGGCTT TGCTTTTTGTAATCTTATTGGTTGGCTTA GCTTTTTGTAATCTTATTGGTTGGCTTAA CTTTTTGTAATCTTATTGGTTGGCTTAAA TTTTTGTAATCTTATTGGTTGGCTTAAAC M N

20 De Bruijn graph TAGTATTATTGCTGCTCATAAAGTAGCTCCAGCTCATCTTGATACTAATGCTTTTTGTAATCTTATTGGTTGGCTTAAACCTAAAAGAGTTGAAGTTAA 2. Make graph of sequential k-mers in sequence J Nodes: k-mers Edges: sequential presence of k-mers in reads K CTCATAAAGTAGCTCCAGCTCATCTTGATACTAATGCTTTTTGTAATCT L GTAGCTCCAGCTCATCTTGATACTAATGCTTTTTGTAATCTTATTGGTT TAGTATTATTGCTGCTCATAAAGTAGCTCCAGCTCATCTTGATACTAAT TTTGTAATCTTATTGGTTGGCTTAAACCTAAAAGAGTTGAAGTTAA CTTGATACTAATGCTTTTTGTAATCTTATTGGTTGGCTTAAAC CTTGATACTAATGCTTTTTGTAATCTTAT TTGATACTAATGCTTTTTGTAATCTTATT TGATACTAATGCTTTTTGTAATCTTATTG GATACTAATGCTTTTTGTAATCTTATTGG ATACTAATGCTTTTTGTAATCTTATTGGT TACTAATGCTTTTTGTAATCTTATTGGTT ACTAATGCTTTTTGTAATCTTATTGGTTG CTAATGCTTTTTGTAATCTTATTGGTTGG TAATGCTTTTTGTAATCTTATTGGTTGGC AATGCTTTTTGTAATCTTATTGGTTGGCT ATGCTTTTTGTAATCTTATTGGTTGGCTT TGCTTTTTGTAATCTTATTGGTTGGCTTA GCTTTTTGTAATCTTATTGGTTGGCTTAA CTTTTTGTAATCTTATTGGTTGGCTTAAA TTTTTGTAATCTTATTGGTTGGCTTAAAC M N

21 De Bruijn graph TAGTATTATTGCTGCTCATAAAGTAGCTCCAGCTCATCTTGATACTAATGCTTTTTGTAATCTTATTGGTTGGCTTAAACCTAAAAGAGTTGAAGTTAA 3. Find Eulerian path Path that contains every edge once Efficient algorithm available In an optimal sequencing run of a repeat-less genome, there is one path connecting all nodes In practice (especially in metagenomes) there are many possible structures in the graph Edge width represents the number of linking reads (depth)

22 Possible structures in De Bruijn graphs Cycle: path converges on itself Repeated region on the same contig Frayed rope: converge then diverge Repeated region on different contigs Bubble: paths diverge then converge Sequencing error in the middle of a read Polymorphisms Spur: short dead-ends Sequencing error at the end of a read Zero coverage shortly after end of repeat

23 Repeated regions In overlap-layout-consensus and De Bruijn graphs reads K-mers Li BFG 2012

24 Genome versus metagenome Depending on coverage Depending on diversity Expect single sequence Expect many sequences Contiguous sequence Fragmented sequences Even read depth Varying read depth Clonal sequence Identify sequencing errors by low coverage Repeats consist of duplicated genes and conserved domains Natural microdiversity Sequencing errors or natural diversity? Repeats also include closely related strains, conserved genes, etc.

25 Chimerization in metagenome assembly Both OLC and DBG include chimera protection Break contigs at ambiguities Works if depth/coverage is high enough contig1 contig2 contig3 contig4 contig5 Assess final result with different parameters High versus low stringency assembly Chimerization is more frequent between closely related strains

26 Assembly strategies Reference-guided assembly Align reads to a (database) of reference genome(s) Cannot discover: Larger genomic mutations Insertions, deletions, rearrangements Distantly related species Most viruses De novo assembly Requires sufficient read lengths, depth, and coverage Breaks on long repeats and low-coverage regions Algorithms Greedy assembly (only to illustrate) Overlap-layout-consensus De Bruijn graph

27 Scaffolding Use alignments to a related genome sequences to sort and orient de novo contigs Silva et al. Source Code Biol. Med. 2013