Supplementary Figures and Tables

Similar documents
Next Generation Sequencing for Metagenomics

RNA sequencing with the MinION at Genoscope

Coral holobiont analysis with MinION sequencer onboard Tara

Next Generation Sequencing data Analysis at Genoscope. Jean-Marc Aury

Korilog. high-performance sequence similarity search tool & integration with KNIME platform. Patrick Durand, PhD, CEO. BIOINFORMATICS Solutions

SUPPLEMENTARY MATERIAL FOR THE PAPER: RASCAF: IMPROVING GENOME ASSEMBLY WITH RNA-SEQ DATA

De novo genome assembly with next generation sequencing data!! "

De Novo and Hybrid Assembly

Illumina (Solexa) Throughput: 4 Tbp in one run (5 days) Cheapest sequencing technology. Mismatch errors dominate. Cost: ~$1000 per human genme

Sequence Assembly and Alignment. Jim Noonan Department of Genetics

De novo assembly of human genomes with massively parallel short read sequencing. Mikk Eelmets Journal Club

MinION, GridION, how does Nanopore technology meet the needs of our users?

Supplementary Data for Hybrid error correction and de novo assembly of single-molecule sequencing reads

TruSPAdes: analysis of variations using TruSeq Synthetic Long Reads (TSLR)

De novo whole genome assembly

Nature Biotechnology: doi: /nbt Supplementary Figure 1. Read Complexity

Mate-pair library data improves genome assembly

The Diploid Genome Sequence of an Individual Human

Supplementary Figure 1 Genotyping by Sequencing (GBS) pipeline used in this study to genotype maize inbred lines. The 14,129 maize inbred lines were

Supplementary Figures

Chromosome-scale scaffolding of de novo genome assemblies based on chromatin interactions. Supplementary Material

SUPPLEMENTARY INFORMATION

Hybrid Error Correction and De Novo Assembly with Oxford Nanopore

Introduction to metagenome assembly. Bas E. Dutilh Metagenomic Methods for Microbial Ecologists, NIOO September 18 th 2014

Copyright (c) 2008 Daniel Huson. Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation

Supplementary Material

Fast-SG: an alignment-free algorithm for hybrid assembly

Supplementary Figure 1. Design of the control microarray. a, Genomic DNA from the

Human Genome Sequencing Over the Decades The capacity to sequence all 3.2 billion bases of the human genome (at 30X coverage) has increased

Current'Advances'in'Sequencing' Technology' James'Gurtowski' Schatz'Lab'

C3BI. VARIANTS CALLING November Pierre Lechat Stéphane Descorps-Declère

Sequence assembly. Jose Blanca COMAV institute bioinf.comav.upv.es

The Resurgence of Reference Quality Genome Sequence

Contact us for more information and a quotation

De novo whole genome assembly

CSE182-L16. LW statistics/assembly

Introduction to Short Read Alignment. UCD Genome Center Bioinformatics Core Tuesday 14 June 2016

De novo whole genome assembly

De novo assembly of complex genomes using single molecule sequencing

NGS technologies approaches, applications and challenges!

De novo human genome assemblies reveal spectrum of alternative haplotypes in diverse

The genome of Leishmania panamensis: insights into genomics of the L. (Viannia) subgenus.

Genome Assembly. J Fass UCD Genome Center Bioinformatics Core Friday September, 2015

De novo Genome Assembly

A near perfect de novo assembly of a eukaryotic genome using sequence reads of greater than 10 kilobases generated by the Pacific Biosciences RS II

BIOINFORMATICS ORIGINAL PAPER

Genome sequencing in Senecio squalidus

From Infection to Genbank

Next Generation Sequences & Chloroplast Assembly. 8 June, 2012 Jongsun Park

Analysis of structural variation. Alistair Ward - Boston College

SUPPLEMENTARY INFORMATION

Nature Methods: doi: /nmeth Supplementary Figure 1. Position dependence of methylation parameter shift.

NEXT GENERATION SEQUENCING. Farhat Habib

De Novo Assembly of High-throughput Short Read Sequences

Analysing genomes and transcriptomes using Illumina sequencing

High-Throughput Bioinformatics: Re-sequencing and de novo assembly. Elena Czeizler

Validation of synthetic long reads for use in constructing variant graphs for dairy cattle breeding

Genome Sequencing-- Strategies

Data Basics. Josef K Vogt Slides by: Simon Rasmussen Next Generation Sequencing Analysis

De novo meta-assembly of ultra-deep sequencing data

Supplementary Materials for

user s guide Question 3

Analysis of structural variation. Alistair Ward USTAR Center for Genetic Discovery University of Utah

10/20/2009 Comp 590/Comp Fall

Biol 478/595 Intro to Bioinformatics

SUPPLEMENTARY INFORMATION

Greene 1. Finishing of DEUG The entire genome of Drosophila eugracilis has recently been sequenced using Roche

Finishing of DFIC This project sought to finish DFIC , the terminal 45 kb of the Drosophila

Experimental Design. Sequencing. Data Quality Control. Read mapping. Differential Expression analysis

Whole Genome Profiling provides a robust framework for physical mapping and sequencing in the highly complex and repetitive wheat genome

Direct determination of diploid genome sequences. Supplemental material: contents

Toward a better understanding of plant genomes structure: combining NGS and optical mapping technology to improve the sunflower assembly

Nature Methods: doi: /nmeth Supplementary Figure 1. Construction of a sensitive TetR mediated auxotrophic off-switch.

Next-generation sequencing technologies

CloG: a pipeline for closing gaps in a draft assembly using short reads

Supplement to: The Genomic Sequence of the Chinese Hamster Ovary (CHO)-K1 cell line

A shotgun introduction to sequence assembly (with Velvet) MCB Brem, Eisen and Pachter

Lecture 14: DNA Sequencing

Short Read Alignment to a Reference Genome

AMOS Assembly Validation and Visualization

Sequencing techniques

Chang Xu Mohammad R Nezami Ranjbar Zhong Wu John DiCarlo Yexun Wang

user s guide Question 3

Whole-genome sequencing and variant discovery in C. elegans

Streaming algorithms for real-time analysis of Oxford Nanopore sequencing data

Weitemier et al. Applications in Plant Sciences (9): Data Supplement S1 Page 1

Result Tables The Result Table, which indicates chromosomal positions and annotated gene names, promoter regions and CpG islands, is the best way for

Nature Methods: doi: /nmeth Supplementary Figure 1. Ideograms showing scaffold boundaries and segmental duplication locations.

Data Analysis with CASAVA v1.8 and the MiSeq Reporter

The Basics of Understanding Whole Genome Next Generation Sequence Data

The goal of this project was to prepare the DEUG contig which covers the

Systematic evaluation of spliced alignment programs for RNA- seq data

NB536: Bioinformatics

Structural variation analysis using NGS sequencing

Matthew Tinning Australian Genome Research Facility. July 2012

NUCLEIC ACIDS. DNA (Deoxyribonucleic Acid) and RNA (Ribonucleic Acid): information storage molecules made up of nucleotides.

Exercises: Analysing ChIP-Seq data

Nature Biotechnology: doi: /nbt Supplementary Figure 1. Number and length distributions of the inferred fosmids.

The tomato genome re-seq project

The Genome Analysis Centre. Building Excellence in Genomics and Computa5onal Bioscience

Transcription:

Genome assembly using Nanopore-guided Long and Error-free DNA Reads Mohammed-Amin Madoui 1 *, Stefan Engelen 1 *, Corinne Cruaud 1, Caroline Belser 1, Laurie Bertrand 1, Adriana Alberti 1, Arnaud Lemainque 1, Patrick Wincker 1,2,3, Jean-Marc Aury 1 1 Commissariat à l Energie Atomique (CEA), Institut de Génomique (IG), Genoscope, BP5706, 91057 Evry, France 2 Université d Evry Val d Essonne, UMR 8030, CP5706, 91057 Evry, France. 3 Centre National de Recherche Scientifique (CNRS), UMR 8030, CP5706, 91057 Evry, France *These authors contributed equally to this work Corresponding author Supplementary Figures and Tables

Figure S1. Distribution of Illumina read coverage along a 35-kb MinION read. The black line represents coverage after the first step (BLAT alignment to find seed-reads) and the red line represents coverage after the recruitment step (comparison of seed-reads against the whole set of Illumina reads).

A B C D E F Figure S2. Impact of Illumina coverage and read length on NaS reads. The x-axis represents genome coverage from 10x to 150x. Each curve represents a given read length: 100bp (black), 150bp (red), 200bp (green), 250bp (blue) and 300bp (purple). This benchmark was achieved on 2D reads from run2 (library 20Kb and R7 chemistry). A. Number of produced NaS reads. B. Cumulative size (in bp) of NaS reads. C. Average size (in bp) of NaS reads. D. N50 size (in bp) of NaS reads. E. Identity of NaS reads aligned to the reference genome. F. CPU time (in seconds).

A B C D Figure S3. Comparison of MinION (x axis) and NaS (y axis) read length. Red lines represent x=y, and green lines are linear regressions. A. 1D reads from run2 (20-kb library and R7 chemistry). B. 2D reads from run2 (20-kb library and R7 chemistry). C. 1D reads from run4 (20-kb library and R7.3 chemistry). D. 2D reads from run4 (20-kb library and R7.3 chemistry).

Figure S4. Comparison of MinION and NaS read coverage. For each base of the reference genome, we computed the difference between coverage in NaS and MinION reads. A negative value indicates a lower coverage in NaS read, whereas a positive value indicates a higher coverage in NaS read.

Figure S5. Overview of rdna cluster 1 of Acinetobacter baylyi ADP1. The figure shows a capture of the rdna1 genomic region from Acinetobacter baylyi ADP1 (from 18 kb to 24kb). The first track contains rdna cluster 1 (black rectangle). The blue rectangles represent alignments of 2D NaS reads, whereas green rectangles represent alignments of 2D MinION reads. The two plots represent respectively the coverage of Nas 2D (blue) and MinION 2D (green) reads. The 19,726 bp NaS read cited in the main text is surrounded with a red rectangle. As discussed in the main text, we observed a lower NaS read coverage around repetitive region in contrast with the MinION read coverage.

21X 112X 15X Figure S6. Example of a contig-graph generated from a read spanning a repeated region. Contigs encircled in red shading (ctg2, ctg6 and ctg1) are contigs covered by seed-reads (coverage is stated in black). Contigs surrounded by blue shading are foreign-contigs. Using the Floyd-Warshall algorithm, the following path has been extracted from the graph: ctg2 ctg6 ctg1, leading to a 27,685 bp NaS read. Red and black numbers represent respectively the coverage (seed-reads and recruited-reads) and the length of the corresponding contig.

Figure S7. Example of a contig-graph generated from a read spanning an rdna locus. Contigs are green boxes; length and coverage of a given contig are respectively the black number in bracket under contig name and the red number over contig name. This graph structure shows 7 sources (ctg1, ctg12, ctg14, ctg13, ctg15, ctg11 and ctg7) and 7 sinks (ctg19, ctg21, ctg9, ctg20, ctg2, ctg16 and ctg10) representing the seven rdna clusters of Acinetobacter baylyi ADP1. Using the Floyd-Warshall algorithm, the following path has been extracted from the graph: ctg1 ctg24 ctg4 ctg17 ctg3 ctg27 ctg23 ctg2.

Figure S8.Comparison of Celera assembler and Newbler contig length. The figure shows a comparison of synthetic read size produced using CA (x axis) and Newbler (y axis).

Table S1. Summary statistics of MinION sequencing (R7 and R7.3 chemistries) and NaS reads Read set R 7 R 7.3 1D 2D 1D 2D # reads 12,086 1,145 45,825 7,436 # reads (>10Kb) 638 275 2,971 3,591 Cumulative size (Mbp) 32.3 8.3 86.6 77.7 Average size (bp) 2,672 7,292 1,888 10,455 MinION reads aligned using LAST N50 size (bp) 6,988 9,641 12,878 12,432 Max size (bp) 121,776 37,578 123,135 58,704 Aligned reads 3,451 (28.5%) 870 (76.0%) 6,172 (13.5%) 6,270 (84.3%) Mean identity percent 57.30% 68.67% 56.45% 75.05% Max alignment size 35,340 36,595 54,158 58,656 Error-free reads 0 0 0 0 # reads 736 797 3,981 5,761 # reads (>10Kb) 23 197 144 2,848 Cumulative size (Mbp) 2.6 5.9 14.6 59.7 Average size (bp) 3,507 7,396 3,663 10,370 NaS reads aligned using bwa mem N50 size (bp) 4,051 9,546 4,306 13,050 Max size (bp) 16,962 35,576 31,283 59,863 Aligned reads 736 (100%) 797 (100%) 3,981 (100%) 5,761 (100%) Mean identity percent 99.9982% 99.9935% 99.9929% 99.9888% Max alignment size 16,962 35,576 31,283 59,863 Error-free reads 97.1% 97.7% 98.1% 96.0%

Table S2. Comparative summary of MinION and proovread reads. MinION reads proovread # reads 543 543 # reads (>10Kb) 234 234 Cumulative size (Mbp) 5.2 5.2 Average size (bp) 9,615 9,621 N50 size (bp) 11,226 11,273 Max size (bp) 37,578 36,598 Aligned reads 344 (63.35%) 344 (63.35%) Mean identity percent 61.611% 71.5677% Max alignment size 36,623 35,643 Error-free reads 0 0

TableS3. Performance comparison of BLAT, BWA, BWA mem and Bowtie2 programs. Assembly programs BLAT BWA BWA mem Bowtie 2 Version 36 0.7.10 0.7.10 2.2.4 Parameters tilesize=10 stepsize = 5 -l 10 k 0 -k 10 -N 0 L 10 3,477 1D reads CPU time (seconds) 654 39 3,986 1,155 Number of MinION reads with at least one hit 745 0 336 1 543 2D reads CPU time (seconds) 578 26 2,525 1,060 Number of MinION reads with at least one hit 353 4 325 50

Table S4. Performance comparison of Newbler, MIRA and Celera assembler (CA). Final contigs were aligned using BWA mem software. Assembly programs Newbler MIRA CA Parameters -urt mi 98 job = genome,denovo,accurate technology = solexa ovlhashbits = 25 ; cnsminfrags = 1 ; ovlerrorrate = 0.1 ; cnserrorrate = 0.1 ; utgerrorrate = 0.1 ; utggrapherrorrate = 0.1 # reads 353 242 352 # reads (>10Kb) 148 117 135 Cumulative size (Mbp) 3.30 2.7 3.14 Average size (bp) 9,368 11,175 8,912 N50 size (bp) 11,790 11,947 11,473 Max size (bp) 35,585 34,954 35,288 Aligned reads 353 (100%) 242 (100%) 352 (100%) Mean identity percent 99.9998% 99.9891% 99.9982% Error-free reads 349 (98.86%) 93 (38.42%) 325 (92.32%) Coverage of the reference 59.2% 52.0% 57.2% Average CPU time (seconds) 5.6 26 105 Cumulative CPU time (seconds) 566 2,603 10,947

Table S5. Celera assembler parameters used to assemble NaS reads. mersize 17 merthreshold 0 merdistinct 0.9995 mertotal 0.995 doobt 1 unitigger bogart ovlerrorrate 0.1 utggrapherrorrate 0.1 utgmergeerrorrate 0.1 cnserrorrate 0.1 cgwerrorrate 0.1 ovlconcurrency 24 cnsconcurrency 24 ovlthreads 1 ovlhashbits 22 ovlhashblocklength 10000000 ovlrefblocksize 10000 cnsreuseunitigs 1 cnsminfrags 100 cnspartitions 64

Table S6. Celera assembler parameters used to assemble Illumina reads. ovlconcurrency 4 ovlthreads 8 ovlcorrconcurrency 20 ovlcorrbatchsize 1000 ovlhashbits 25 ovlrefblocksize 4000000 ovlhashblocklength 100000000 cnsconcurrency 48 cnsminfrags 100 frgcorrconcurrency 20 frgcorrbatchsize 1000 frgcorrthreads 4 frgcorrongrid 1 merylthreads 20 merylmemory 100000 mbtconcurrency 20 mbtthreads 4 ovlerrorrate 0.02 utggrapherrorrate 0.01 meroverlapperextendconcurrency 12 meroverlapperseedconcurrency 12 meroverlapperthreads 4 utggenomesize 3600000 unitigger bogart