Genome Assembly. J Fass UCD Genome Center Bioinformatics Core Friday September, 2015

Similar documents
Outline. The types of Illumina data Methods of assembly Repeats Selecting k-mer size Assembly Tools Assembly Diagnostics Assembly Polishing

De novo whole genome assembly

De novo genome assembly with next generation sequencing data!! "

Introduction to metagenome assembly. Bas E. Dutilh Metagenomic Methods for Microbial Ecologists, NIOO September 18 th 2014

A Roadmap to the De-novo Assembly of the Banana Slug Genome

Assembly and Validation of Large Genomes from Short Reads Michael Schatz. March 16, 2011 Genome Assembly Workshop / Genome 10k

Sequence assembly. Jose Blanca COMAV institute bioinf.comav.upv.es

De Novo Assembly of High-throughput Short Read Sequences

De novo whole genome assembly

short read genome assembly Sorin Istrail CSCI1820 Short-read genome assembly algorithms 3/6/2014

Genome Assembly: Background and Strategy

De novo whole genome assembly

Sequence Assembly and Alignment. Jim Noonan Department of Genetics

Lecture 18: Single-cell Sequencing and Assembly. Spring 2018 May 1, 2018

DNA Sequencing and Assembly

Genome Assembly Software for Different Technology Platforms. PacBio Canu Falcon. Illumina Soap Denovo Discovar Platinus MaSuRCA.

De novo assembly of complex genomes using single molecule sequencing

Current'Advances'in'Sequencing' Technology' James'Gurtowski' Schatz'Lab'

Workflow of de novo assembly

Efficient de novo assembly of highly heterozygous genomes from whole-genome shotgun short reads. Supplemental Materials

de novo paired-end short reads assembly

De Novo and Hybrid Assembly

Assembly of Ariolimax dolichophallus using SOAPdenovo2

de novo metagenome assembly

High-Throughput Bioinformatics: Re-sequencing and de novo assembly. Elena Czeizler

TruSPAdes: analysis of variations using TruSeq Synthetic Long Reads (TSLR)

Genome Sequencing and Assembly

Alignment and Assembly

Course summary. Today. PCR Polymerase chain reaction. Obtaining molecular data. Sequencing. DNA sequencing. Genome Projects.

Supplementary Data for Hybrid error correction and de novo assembly of single-molecule sequencing reads

BIOINFORMATICS ORIGINAL PAPER

De novo meta-assembly of ultra-deep sequencing data

De novo assembly of human genomes with massively parallel short read sequencing. Mikk Eelmets Journal Club

GENOME ASSEMBLY FINAL PIPELINE AND RESULTS

A shotgun introduction to sequence assembly (with Velvet) MCB Brem, Eisen and Pachter

Genome Assembly Workshop Titles and Abstracts

Genome Assembly Background and Strategy

Mapping. Main Topics Sept 11. Saving results on RCAC Scaffolding and gap closing Assembly quality

Bioinformatics for Genomics

Contact us for more information and a quotation

Mate-pair library data improves genome assembly

Introduction to Bioinformatics

De novo assembly in RNA-seq analysis.

NOW GENERATION SEQUENCING. Monday, December 5, 11

Lectures 20, 21: Single- cell Sequencing and Assembly. Spring 2017 April 20,25, 2017

Next Generation Sequences & Chloroplast Assembly. 8 June, 2012 Jongsun Park

Genome Sequencing-- Strategies

Supplementary Figure 1. Design of the control microarray. a, Genomic DNA from the

Consensus Ensemble Approaches Improve De Novo Transcriptome Assemblies

Introduction: Methods:

De novo Genome Assembly

The Diploid Genome Sequence of an Individual Human

The Basics of Understanding Whole Genome Next Generation Sequence Data

CloG: a pipeline for closing gaps in a draft assembly using short reads

Genome Assembly, part II. Tandy Warnow

Genome Projects. Part III. Assembly and sequencing of human genomes

Building and Improving Reference Genome Assemblies

The MaSuRCA genome Assembler Aleksey Zimin 1,*, Guillaume Marçais 1, Daniela Puiu 2, Michael Roberts 1, Steven L. Salzberg 2, and James A.

We begin with a high-level overview of sequencing. There are three stages in this process.

Genomic Technologies. Michael Schatz. Feb 1, 2018 Lecture 2: Applied Comparative Genomics

10/20/2009 Comp 590/Comp Fall

NGS developments in tomato genome sequencing

Understanding Accuracy in SMRT Sequencing

Gap Filling for a Human MHC Haplotype Sequence

IDBA-UD: A de Novo Assembler for Single-Cell and Metagenomic Sequencing Data with Highly Uneven Depth

arxiv: v2 [q-bio.gn] 21 May 2012

Direct determination of diploid genome sequences. Supplemental material: contents

Next Generation Sequencing Technologies

State of the art de novo assembly of human genomes from massively parallel sequencing data

De Novo Assembly (Pseudomonas aeruginosa MAPO1 ) Sample to Insight

Microbiome: Metagenomics 4/4/2018

Faction 2: Genome Assembly Lab and Preliminary Data

Introduction to Short Read Alignment. UCD Genome Center Bioinformatics Core Tuesday 14 June 2016

From Infection to Genbank

Genome Assembly Using de Bruijn Graphs. Biostatistics 666

A near perfect de novo assembly of a eukaryotic genome using sequence reads of greater than 10 kilobases generated by the Pacific Biosciences RS II

Looking Ahead: Improving Workflows for SMRT Sequencing

The Resurgence of Reference Quality Genome Sequence

Reevaluating Assembly Evaluations with Feature Response Curves: GAGE and Assemblathons Francesco Vezzi 1,, Giuseppe Narzisi 2, Bud Mishra 2,3,4

Assembly. Ian Misner, Ph.D. Bioinformatics Crash Course. Bioinformatics Core

Sars International Centre for Marine Molecular Biology, University of Bergen, Bergen, Norway

de novo Transcriptome Assembly Nicole Cloonan 1 st July 2013, Winter School, UQ

Genomic resources. for non-model systems

Next Gen Sequencing. Expansion of sequencing technology. Contents

Haploid Assembly of Diploid Genomes

ChIP-Seq Data Analysis. J Fass UCD Genome Center Bioinformatics Core Wednesday 15 June 2015

ChIP-Seq Tools. J Fass UCD Genome Center Bioinformatics Core Wednesday September 16, 2015

Validation of synthetic long reads for use in constructing variant graphs for dairy cattle breeding

DE NOVO WHOLE GENOME ASSEMBLY AND SEQUENCING OF THE SUPERB FAIRYWREN. (Malurus cyaneus) JOSHUA PEÑALBA LEO JOSEPH CRAIG MORITZ ANDREW COCKBURN

Genomics and Transcriptomics of Spirodela polyrhiza

ChIP-Seq Data Analysis. J Fass UCD Genome Center Bioinformatics Core Wednesday December 17, 2014

PERGA: A Paired-End Read Guided De Novo Assembler for Extending Contigs Using SVM and Look Ahead Approach

SCIENCE CHINA Life Sciences. Comparative analysis of de novo transcriptome assembly

Lecture 14: DNA Sequencing

Assembling metagenomes: a not so practical guide

De novo genome assembly. Dr Torsten Seemann

Genome Assembly With Next Generation Sequencers

Outline General NGS background and terms 11/14/2016 CONFLICT OF INTEREST. HLA region targeted enrichment. NGS library preparation methodologies

AMOS Assembly Validation and Visualization

Experimental Design. Sequencing. Data Quality Control. Read mapping. Differential Expression analysis

Transcription:

Genome Assembly J Fass UCD Genome Center Bioinformatics Core Friday September, 2015

From reads to molecules

What s the Problem? How to get the best assemblies for the smallest expense (sequencing) and least effort (bioinformatics).

What s the Problem? "[...] repeats are the single biggest impediment to all assembly algorithms and sequencing technologies." ~ Koren 2012 Nat Biotech

What s the Problem? Repeats larger than the read (or template) length are impossible to resolve unambiguously.

What s the Problem? Repeats larger than the read (or template) length are impossible to resolve unambiguously. A R B

What s the Problem? Repeats larger than the read (or template) length are impossible to resolve unambiguously. R A R B R?? A R B

What s the Problem? Repeats larger than the read (or template) length are impossible to resolve unambiguously. R B R A R?? A R B

What s the Problem? Magical FutureSeqTM reads easily resolve these long repetitive regions, but have unfortunately been slow coming to market. R A R B R!! R A R B R

What s the Problem? Assembly graphs with perfect reads of length k Koren & Phillippy 2015 Curent Opinions in Microbiology 23:110

Software ~timeline Celera Assembler ( OLC assembler used for whole-genome shotgun human assembly, as opposed to NIH BAC-by-BAC approach)... now open source wgs-assembler Velvet (one of 1st de Bruijn graph assemblers) ALLPATHS-LG (de Bruijn, recipe-based) SGA - String Graph Assembler

OLC Assemblers

OLC Assemblers Overlap

OLC Assemblers Overlap Layout A R B

OLC Assemblers Overlap Layout Consensus. R OLC A R B R A R B

de Bruijn graph assembly To reduce computational challenge from millions of reads, break them up into smaller chunks.?!!

Constructing an assembly "graph"

Constructing an assembly "graph"

Constructing an assembly "graph"

Constructing an assembly "graph"

Constructing an assembly "graph"

de Bruijn graph assembler, Velvet Build graph from 7 bp reads, with ernors... using 4 bp k-mers Tracking k-mers, not reads, essentially compresses the data... important for NextGen era!

de Bruijn graph assembler, Velvet Tip Removal Bubble Popping (Coverage Constraints) Cutting at every ambiguity (branch point) yields the final contigs: TAGTCGAG GAGGCTTAGA AGATCGGATGAG AGAGACAG Zerbino 2008 Genome Research 18: 821-829

K-mer coverage...? Performance (speed, memory, effectiveness of assembly) of de Bruijn-graph assemblers is correlated with k-mer coverage, not base coverage.

Base coverage

K-mer coverage k-mers tile across reads (L - k + 1) k-mers in a read of length L

Error Exclusion Smaller k will increase coverage of true kmers (peak shifts to the right), but not error kmers. Choosing a coverage cutoff that separates the two distributions will simplify the graph, removing noise and leaving signal. Simple graph = longer contigs!

Choosing k Smaller k-mers increase the connectivity of the graph by simultaneously increasing the chance of observing an overlap between two reads and the number of ambiguous repeats in the graph. There is therefore a balance between sensitivity and specificity determined by k. ~Zerbino (2008) Genome Research 18:821

Choosing k

Choosing k

Assembly Miscellanea

Hierarchical Assembly Amplify Bacterial Artificial Chromosomes, Fosmids, etc.... sequence, assemble (simpler problem for BACs than chromosomes), then assemble the assemblies.

Scaffoldering

Gap filling / contig extension

Gap filling / contig extension IMAGE (Iterative Mapping and Assembly for Gap Elimination) Tsai 2010 Genome Biology 11:R41 PRICE (Paired Read Iterative Contig Extension) DeRisi lab, UCSF

Reference-assisted assembly

Error Correction (Quake)

Error Correction (Quake)

Error Correction (Quake)

Error Correction Similar correction methods are incorporated into modern assemblers (like SOAPdenovo, SGA, ALLPATHS), and error exclusion (based on k-mer coverage) is an element of some (Velvet...)

Digital Normalization K-mer based one-pass filtering/trimming of short reads; discards redundant data to even out uneven coverage, and preferentially discards or trims error-containing reads. This reduces graph size (RAM) and computation time for assemblers. Brown 2012 arxiv:1203.4802v2

Digital Normalization Based on median k-mer abundance / coverage, diginorm discards the majority of errorcontaining k-mers, while retaining nearly all real k-mers - (discards data, not information). Brown 2012 arxiv:1203.4802v2

Diginorm (second pass - trimming) After digital normalization, make a second pass wherein 3'-end of reads are trimmed to remove low frequency k-mers.

Diginorm (third pass - normalization) After trimming, do another normalization pass. Trimming in between two normalization passes allows more discrimination between erroneous and real k-mers. Majority of computational time is in first pass (normalization), so three-pass approach is not much more demanding than single-pass approach.

Assemblers of note...

SPAdes uneven coverage, chimerism ( St. Petersburg Assembler ) Nurk, Bankevich et al. (2013 book chapter) DOI:10.1007/978-3-642-37195-0_13 Bankevich, Nurk et al. (2012) J Comp Biol DOI:10.1089/cmb.2012.0021 Deals with highly uneven coverage depth (like IDBA_UD) but also high rates of chimerism in sequencing libraries (more of a problem for single-cell assemblies amplified with Multiple Displacement Amplification - micrometagenomes?). Users of SPAdes report: It just works

Allpaths-LG... and its "recipe" Ribeiro (2012) Genome Research doi: 10.1101/gr.141515.112 Gnerre (2011) PNAS 108:1513 Makes use of a recipe of three (or four) different libraries (see below) can be run without largest scale libraries, but not for best results. Makes sense for an institute that can standardize its sequencing and bioinformatics together. Gnerre 2011: 45x Overlapping PE reads (180 bp ISIZE, >100bp reads) 45x Short jump / MP (3kb) 5x.. Optional long jump / MP (6kb) 1x.. Optional fosmid jump / MP (40kb) Ribeiro 2012: 50x Overlapping PE reads (180bp ISIZE, >100bp reads) 50x 1-3kb PacBio reads 50x Long jump / MP (2-10kb)

sga: String Graph Assembler Simpson, J and Durbin, R (2010) Efficient construction of an assembly string graph using the FM-index Bioinformatics 26: i367 String graphs retain the information lost by de Bruijn graphs full read context by building graphs based on the full overlaps between reads (instead of k-mers). But, this requires all-to-all overlap detection! sga utilizes BWT & FM-index to make this tractable, but graph construction is still the most (computationally) expensive step. Compared to de Bruijn graph assemblers, sga uses less memory, but is significantly slower.

DISCOVAR de novo Weisenfeld, et al. (2014) Comprehensive variation discovery in single human genomes Nature Genetics 46:1350 (Publication is for DISCOVAR -- assembly and variant finding for smaller organisms -- not DISCOVAR de novo -- assembler for large genomes) Uses a single PCR-free, SPRI bead size selected library, and at least 60x coverage with PE250 reads. The size selection yields a broad spectrum of fragment sizes, and the longer distance read pairs are used for scaffolding. Polymorphic sequences can be pulled from the resulting graph structure, or consensus sequences.

Bringing PacBio into the picture

PacBio Read Correction "PBcR" (web page at UMd) http://www.cbcb.umd.edu/software/pbcr/ links to spec files, raw data PBcR (wgs-assembler script) pages in wgs-assembler (Celera Assembler) wiki: http: //wgs-assembler.sourceforge.net/wiki/index.php/pbcr ec-tools code on GitHub: https://github.com/jgurtowski/ectools plus data: http://schatzlab.cshl.edu/data/ectools/ also in SMRT-analysis software code on GitHub: https://github. com/pacificbiosciences/smrt-analysis

PacBio Read Correction short, high accuracy reads mapped to PB reads Illumina, 454, PB-CCS small coverage gaps recruit other PB reads to fill them large coverage gaps split reads (maxgap option controls cutoff size) recommended minimum: 20-30 X PacBio 50 X high accuracy reads

PacBio Read Correction maxgap? Gaps shorter than 'maxgap' setting get a chance to recruit multiple PB reads for support / correction Gaps longer than 'maxgap' setting automatically split no yes Koren, personal communication

PacBio Read Correction More recent Koren paper available at arxiv.org... check: http://www.cbcb.umd.edu/software/pbcr/ Discusses PB read self-correction (for long reads from C2 or better chemistry). No independent high-accuracy reads needed; PB reads aligned to each other to infer consensus sequence. Implemented in Celera Assembler (wgs-assembler pacbiotoca script) and in PacBio s HGAP pipeline. Also, MHAP for faster alignment of long, noisy reads (reduces bottleneck in assembly).

historic Genome Assemblers Celera Assembler (used for whole-genome shotgun human assembly, as opposed to NIH BAC-by-BAC approach)... now, wgs-assembler (PBcR!) Velvet (one of 1st de Bruijn graph assemblers) ALLPATHS-LG (de Bruijn, recipe-based) SGA - String Graph Assembler With high accuracy long reads, older OLC assemblers become more appropriate

How to incorporate PacBio? small-ish (< 10 Mbp) genomes: 100x PacBio PBcR or HGAP medium (10-100 Mbp): 60-100x PacBio HGAP moderate (< 1 Gbp): > 20x PacBio, 50x Illumina PBcR or EC Tools, DBG2OLC? large (> 1 Gbp): > 5x PacBio, 50-200x Illumina Illumina assembly, then PBSuite

PacBio s HGAP (Not shown) - Quiver algorithm polishes assembly by aligning all reads to finished genome, and calling a new consensus Quiver polishing Chin (2013) Nature Methods 10:563 doi:10.1038/nmeth.2474

Assembly Assessment

Assembly Competitions Assemblathon http://assemblathon.org/ 1: Earl 2011 Genome Research 21:2224 2: ArXiv.org - http://arxiv.org/abs/1301.5406 GAGE - Genome Assembly Gold-standard Evaluations http://gage.cbcb.umd.edu/ dngasp - de novo Genome Assembly Project http://cnag.bsc.es/

Assembly Assessment N50 NG50 Cumulative Length Plots Feature Response Curves (Alignment) Block NG50 (versus a good? reference) Read alignment methods

N50

N50, confused

NG50

Cumulative Length Plots

Align to Trusted Reference Mauve s contig reorder tool: http://gel.ahabs.wisc.edu/mauve/

(Alignment) Block NG50

Cumulative (Alignment) Length Plots

Read-based Assessment (AMOS-validate), FRCurves, REAPR Vezzi 2012 PLoS One DOI: 10.1371/journal.pone.0031002

Questions?