Genome Assembly J Fass UCD Genome Center Bioinformatics Core Friday September, 2015
From reads to molecules
What s the Problem? How to get the best assemblies for the smallest expense (sequencing) and least effort (bioinformatics).
What s the Problem? "[...] repeats are the single biggest impediment to all assembly algorithms and sequencing technologies." ~ Koren 2012 Nat Biotech
What s the Problem? Repeats larger than the read (or template) length are impossible to resolve unambiguously.
What s the Problem? Repeats larger than the read (or template) length are impossible to resolve unambiguously. A R B
What s the Problem? Repeats larger than the read (or template) length are impossible to resolve unambiguously. R A R B R?? A R B
What s the Problem? Repeats larger than the read (or template) length are impossible to resolve unambiguously. R B R A R?? A R B
What s the Problem? Magical FutureSeqTM reads easily resolve these long repetitive regions, but have unfortunately been slow coming to market. R A R B R!! R A R B R
What s the Problem? Assembly graphs with perfect reads of length k Koren & Phillippy 2015 Curent Opinions in Microbiology 23:110
Software ~timeline Celera Assembler ( OLC assembler used for whole-genome shotgun human assembly, as opposed to NIH BAC-by-BAC approach)... now open source wgs-assembler Velvet (one of 1st de Bruijn graph assemblers) ALLPATHS-LG (de Bruijn, recipe-based) SGA - String Graph Assembler
OLC Assemblers
OLC Assemblers Overlap
OLC Assemblers Overlap Layout A R B
OLC Assemblers Overlap Layout Consensus. R OLC A R B R A R B
de Bruijn graph assembly To reduce computational challenge from millions of reads, break them up into smaller chunks.?!!
Constructing an assembly "graph"
Constructing an assembly "graph"
Constructing an assembly "graph"
Constructing an assembly "graph"
Constructing an assembly "graph"
de Bruijn graph assembler, Velvet Build graph from 7 bp reads, with ernors... using 4 bp k-mers Tracking k-mers, not reads, essentially compresses the data... important for NextGen era!
de Bruijn graph assembler, Velvet Tip Removal Bubble Popping (Coverage Constraints) Cutting at every ambiguity (branch point) yields the final contigs: TAGTCGAG GAGGCTTAGA AGATCGGATGAG AGAGACAG Zerbino 2008 Genome Research 18: 821-829
K-mer coverage...? Performance (speed, memory, effectiveness of assembly) of de Bruijn-graph assemblers is correlated with k-mer coverage, not base coverage.
Base coverage
K-mer coverage k-mers tile across reads (L - k + 1) k-mers in a read of length L
Error Exclusion Smaller k will increase coverage of true kmers (peak shifts to the right), but not error kmers. Choosing a coverage cutoff that separates the two distributions will simplify the graph, removing noise and leaving signal. Simple graph = longer contigs!
Choosing k Smaller k-mers increase the connectivity of the graph by simultaneously increasing the chance of observing an overlap between two reads and the number of ambiguous repeats in the graph. There is therefore a balance between sensitivity and specificity determined by k. ~Zerbino (2008) Genome Research 18:821
Choosing k
Choosing k
Assembly Miscellanea
Hierarchical Assembly Amplify Bacterial Artificial Chromosomes, Fosmids, etc.... sequence, assemble (simpler problem for BACs than chromosomes), then assemble the assemblies.
Scaffoldering
Gap filling / contig extension
Gap filling / contig extension IMAGE (Iterative Mapping and Assembly for Gap Elimination) Tsai 2010 Genome Biology 11:R41 PRICE (Paired Read Iterative Contig Extension) DeRisi lab, UCSF
Reference-assisted assembly
Error Correction (Quake)
Error Correction (Quake)
Error Correction (Quake)
Error Correction Similar correction methods are incorporated into modern assemblers (like SOAPdenovo, SGA, ALLPATHS), and error exclusion (based on k-mer coverage) is an element of some (Velvet...)
Digital Normalization K-mer based one-pass filtering/trimming of short reads; discards redundant data to even out uneven coverage, and preferentially discards or trims error-containing reads. This reduces graph size (RAM) and computation time for assemblers. Brown 2012 arxiv:1203.4802v2
Digital Normalization Based on median k-mer abundance / coverage, diginorm discards the majority of errorcontaining k-mers, while retaining nearly all real k-mers - (discards data, not information). Brown 2012 arxiv:1203.4802v2
Diginorm (second pass - trimming) After digital normalization, make a second pass wherein 3'-end of reads are trimmed to remove low frequency k-mers.
Diginorm (third pass - normalization) After trimming, do another normalization pass. Trimming in between two normalization passes allows more discrimination between erroneous and real k-mers. Majority of computational time is in first pass (normalization), so three-pass approach is not much more demanding than single-pass approach.
Assemblers of note...
SPAdes uneven coverage, chimerism ( St. Petersburg Assembler ) Nurk, Bankevich et al. (2013 book chapter) DOI:10.1007/978-3-642-37195-0_13 Bankevich, Nurk et al. (2012) J Comp Biol DOI:10.1089/cmb.2012.0021 Deals with highly uneven coverage depth (like IDBA_UD) but also high rates of chimerism in sequencing libraries (more of a problem for single-cell assemblies amplified with Multiple Displacement Amplification - micrometagenomes?). Users of SPAdes report: It just works
Allpaths-LG... and its "recipe" Ribeiro (2012) Genome Research doi: 10.1101/gr.141515.112 Gnerre (2011) PNAS 108:1513 Makes use of a recipe of three (or four) different libraries (see below) can be run without largest scale libraries, but not for best results. Makes sense for an institute that can standardize its sequencing and bioinformatics together. Gnerre 2011: 45x Overlapping PE reads (180 bp ISIZE, >100bp reads) 45x Short jump / MP (3kb) 5x.. Optional long jump / MP (6kb) 1x.. Optional fosmid jump / MP (40kb) Ribeiro 2012: 50x Overlapping PE reads (180bp ISIZE, >100bp reads) 50x 1-3kb PacBio reads 50x Long jump / MP (2-10kb)
sga: String Graph Assembler Simpson, J and Durbin, R (2010) Efficient construction of an assembly string graph using the FM-index Bioinformatics 26: i367 String graphs retain the information lost by de Bruijn graphs full read context by building graphs based on the full overlaps between reads (instead of k-mers). But, this requires all-to-all overlap detection! sga utilizes BWT & FM-index to make this tractable, but graph construction is still the most (computationally) expensive step. Compared to de Bruijn graph assemblers, sga uses less memory, but is significantly slower.
DISCOVAR de novo Weisenfeld, et al. (2014) Comprehensive variation discovery in single human genomes Nature Genetics 46:1350 (Publication is for DISCOVAR -- assembly and variant finding for smaller organisms -- not DISCOVAR de novo -- assembler for large genomes) Uses a single PCR-free, SPRI bead size selected library, and at least 60x coverage with PE250 reads. The size selection yields a broad spectrum of fragment sizes, and the longer distance read pairs are used for scaffolding. Polymorphic sequences can be pulled from the resulting graph structure, or consensus sequences.
Bringing PacBio into the picture
PacBio Read Correction "PBcR" (web page at UMd) http://www.cbcb.umd.edu/software/pbcr/ links to spec files, raw data PBcR (wgs-assembler script) pages in wgs-assembler (Celera Assembler) wiki: http: //wgs-assembler.sourceforge.net/wiki/index.php/pbcr ec-tools code on GitHub: https://github.com/jgurtowski/ectools plus data: http://schatzlab.cshl.edu/data/ectools/ also in SMRT-analysis software code on GitHub: https://github. com/pacificbiosciences/smrt-analysis
PacBio Read Correction short, high accuracy reads mapped to PB reads Illumina, 454, PB-CCS small coverage gaps recruit other PB reads to fill them large coverage gaps split reads (maxgap option controls cutoff size) recommended minimum: 20-30 X PacBio 50 X high accuracy reads
PacBio Read Correction maxgap? Gaps shorter than 'maxgap' setting get a chance to recruit multiple PB reads for support / correction Gaps longer than 'maxgap' setting automatically split no yes Koren, personal communication
PacBio Read Correction More recent Koren paper available at arxiv.org... check: http://www.cbcb.umd.edu/software/pbcr/ Discusses PB read self-correction (for long reads from C2 or better chemistry). No independent high-accuracy reads needed; PB reads aligned to each other to infer consensus sequence. Implemented in Celera Assembler (wgs-assembler pacbiotoca script) and in PacBio s HGAP pipeline. Also, MHAP for faster alignment of long, noisy reads (reduces bottleneck in assembly).
historic Genome Assemblers Celera Assembler (used for whole-genome shotgun human assembly, as opposed to NIH BAC-by-BAC approach)... now, wgs-assembler (PBcR!) Velvet (one of 1st de Bruijn graph assemblers) ALLPATHS-LG (de Bruijn, recipe-based) SGA - String Graph Assembler With high accuracy long reads, older OLC assemblers become more appropriate
How to incorporate PacBio? small-ish (< 10 Mbp) genomes: 100x PacBio PBcR or HGAP medium (10-100 Mbp): 60-100x PacBio HGAP moderate (< 1 Gbp): > 20x PacBio, 50x Illumina PBcR or EC Tools, DBG2OLC? large (> 1 Gbp): > 5x PacBio, 50-200x Illumina Illumina assembly, then PBSuite
PacBio s HGAP (Not shown) - Quiver algorithm polishes assembly by aligning all reads to finished genome, and calling a new consensus Quiver polishing Chin (2013) Nature Methods 10:563 doi:10.1038/nmeth.2474
Assembly Assessment
Assembly Competitions Assemblathon http://assemblathon.org/ 1: Earl 2011 Genome Research 21:2224 2: ArXiv.org - http://arxiv.org/abs/1301.5406 GAGE - Genome Assembly Gold-standard Evaluations http://gage.cbcb.umd.edu/ dngasp - de novo Genome Assembly Project http://cnag.bsc.es/
Assembly Assessment N50 NG50 Cumulative Length Plots Feature Response Curves (Alignment) Block NG50 (versus a good? reference) Read alignment methods
N50
N50, confused
NG50
Cumulative Length Plots
Align to Trusted Reference Mauve s contig reorder tool: http://gel.ahabs.wisc.edu/mauve/
(Alignment) Block NG50
Cumulative (Alignment) Length Plots
Read-based Assessment (AMOS-validate), FRCurves, REAPR Vezzi 2012 PLoS One DOI: 10.1371/journal.pone.0031002
Questions?