Mate-pair library data improves genome assembly

Similar documents
Matthew Tinning Australian Genome Research Facility. July 2012

De Novo and Hybrid Assembly

Genome Projects. Part III. Assembly and sequencing of human genomes

Introduction to metagenome assembly. Bas E. Dutilh Metagenomic Methods for Microbial Ecologists, NIOO September 18 th 2014

Genome Sequencing-- Strategies

Biol 478/595 Intro to Bioinformatics

Next Gen Sequencing. Expansion of sequencing technology. Contents

Next-generation sequencing technologies

NGS developments in tomato genome sequencing

Sequence Assembly and Alignment. Jim Noonan Department of Genetics

DNA Sequencing and Assembly

Sequence assembly. Jose Blanca COMAV institute bioinf.comav.upv.es

CSE182-L16. LW statistics/assembly

Lecture 7. Next-generation sequencing technologies

Connect-A-Contig Paper version

02 Agenda Item 03 Agenda Item

NEXT GENERATION SEQUENCING. Farhat Habib

Genomic resources. for non-model systems

Figure S1. Unrearranged locus. Rearranged locus. Concordant read pairs. Region1. Region2. Cluster of discordant read pairs, bundle

The Irys System. Rapid Genome Wide Mapping for de novo Assembly and Structural Variation Analysis. Jack Peart, Ph.D. Director of Sales EMEA

De Novo Assembly of High-throughput Short Read Sequences

10/20/2009 Comp 590/Comp Fall

Aaron Liston, Oregon State University Botany 2012 Intro to Next Generation Sequencing Workshop

A Guide to Consed Michelle Itano, Carolyn Cain, Tien Chusak, Justin Richner, and SCR Elgin.

High Throughput Sequencing Technologies. J Fass UCD Genome Center Bioinformatics Core Monday September 15, 2014

A Crash Course in NGS for GI Pathologists. Sandra O Toole

Alignment and Assembly

Data Basics. Josef K Vogt Slides by: Simon Rasmussen Next Generation Sequencing Analysis

De novo Genome Assembly

Recombinant DNA recombinant DNA DNA cloning gene cloning

Welcome to the NGS webinar series

The Diploid Genome Sequence of an Individual Human

1. A brief overview of sequencing biochemistry

Next Generation Sequencing. Jeroen Van Houdt - Leuven 13/10/2017

Next-generation sequencing Technology Overview

Metagenomic 3C, full length 16S amplicon sequencing on Illumina, and the diabetic skin microbiome

We begin with a high-level overview of sequencing. There are three stages in this process.

1.1 Post Run QC Analysis

Reading Lecture 8: Lecture 9: Lecture 8. DNA Libraries. Definition Types Construction

4. Analysing genes II Isolate mutants*

Greene 1. Finishing of DEUG The entire genome of Drosophila eugracilis has recently been sequenced using Roche

Integrated NGS Sample Preparation Solutions for Limiting Amounts of RNA and DNA. March 2, Steven R. Kain, Ph.D. ABRF 2013

CONSTRUCTION OF GENOMIC LIBRARY

Outline General NGS background and terms 11/14/2016 CONFLICT OF INTEREST. HLA region targeted enrichment. NGS library preparation methodologies

fastest next-gen workflow 10X more throughput fastest-selling sequencer all in six months Ion Personal Genome Machine Sequencer

NB536: Bioinformatics

3) This diagram represents: (Indicate all correct answers)

Lecture 14: DNA Sequencing

Course summary. Today. PCR Polymerase chain reaction. Obtaining molecular data. Sequencing. DNA sequencing. Genome Projects.

The goal of this project was to prepare the DEUG contig which covers the

Restriction Site Mapping:

High Throughput Sequencing Technologies. UCD Genome Center Bioinformatics Core Monday 15 June 2015

Chromosome-scale scaffolding of de novo genome assemblies based on chromatin interactions. Supplementary Material

Molecular Biology: DNA sequencing

Comparison and Evaluation of Cotton SNPs Developed by Transcriptome, Genome Reduction on Restriction Site Conservation and RAD-based Sequencing

Finishing of Fosmid 1042D14. Project 1042D14 is a roughly 40 kb segment of Drosophila ananassae

GENETICS - CLUTCH CH.15 GENOMES AND GENOMICS.

Next generation sequencing techniques" Toma Tebaldi Centre for Integrative Biology University of Trento

Assemblytics: a web analytics tool for the detection of assembly-based variants Maria Nattestad and Michael C. Schatz

Introduction to Plant Genomics and Online Resources. Manish Raizada University of Guelph

Contact us for more information and a quotation

High throughput omics and BIOINFORMATICS

Computational Biology 2. Pawan Dhar BII

Finishing Fosmid DMAC-27a of the Drosophila mojavensis third chromosome

Sept 2. Structure and Organization of Genomes. Today: Genetic and Physical Mapping. Sept 9. Forward and Reverse Genetics. Genetic and Physical Mapping

Introduction to CGE tools

RNA-Sequencing analysis

The Journey of DNA Sequencing. Chromosomes. What is a genome? Genome size. H. Sunny Sun

De novo genome assembly with next generation sequencing data!! "

Overview: The DNA Toolbox

Human Genome Sequencing Over the Decades The capacity to sequence all 3.2 billion bases of the human genome (at 30X coverage) has increased

Deep Sequencing technologies

The Structure of Proteins and DNA

Bi 8 Lecture 4. Ellen Rothenberg 14 January Reading: from Alberts Ch. 8

Introductory Next Gen Workshop

Nature Biotechnology: doi: /nbt Supplementary Figure 1. Read Complexity

Multiple choice questions (numbers in brackets indicate the number of correct answers)

Building with DNA 2. Andrew Tolonen Genoscope et l'université d'évry 08 october atolonen at

RIPTIDE HIGH THROUGHPUT RAPID LIBRARY PREP (HT-RLP)

N50 must die!? Genome assembly workshop, Santa Cruz, 3/15/11

Next Generation Sequencing. Tobias Österlund

A draft sequence of bread wheat chromosome 7B based on individual MTP BAC sequencing using pair end and mate pair libraries.

Result Tables The Result Table, which indicates chromosomal positions and annotated gene names, promoter regions and CpG islands, is the best way for

Genome Analysis. Bacterial genome projects

From Infection to Genbank

Biochemistry 661 Your Name: Nucleic Acids, Module I Prof. Jason Kahn Exam I (100 points total) September 23, 2010

De novo whole genome assembly

Recombinant DNA Technology

CloG: a pipeline for closing gaps in a draft assembly using short reads

The Basics of Understanding Whole Genome Next Generation Sequence Data

Direct determination of diploid genome sequences. Supplemental material: contents

BIOINFORMATICS ORIGINAL PAPER

High Throughput Sequencing Technologies. J Fass UCD Genome Center Bioinformatics Core Tuesday December 16, 2014

Get to Know Your DNA. Every Single Fragment.

Finishing Drosophila ananassae Fosmid 2410F24

Genomic Technologies. Michael Schatz. Feb 1, 2018 Lecture 2: Applied Comparative Genomics

DNA METHYLATION RESEARCH TOOLS

Structural variation analysis using NGS sequencing

Whole-genome sequencing (WGS) of microbes employing nextgeneration sequencing (NGS) technologies enables pathogen

Corynebacterium pseudotuberculosis genome sequencing: Final Report

Transcription:

De Novo Sequencing on the Ion Torrent PGM APPLICATION NOTE Mate-pair library data improves genome assembly Highly accurate PGM data allows for de Novo Sequencing and Assembly For a draft assembly, generate contig assembly of N50 (on fragment data) of >119kb For a higher quality assembly, add Mate-pair data to generate an increase in contigs, N50 of >177kb (~50% increase) with only 22 scaffolds (N50 > 2.3Mb) In order to carry out de novo sequencing large amounts of high quality sequence is required. The Ion Torrent PGM has a simple fast workflow that enables the rapid generation of large volumes of data. Rapid whole-genome sequencing of epidemiologically relevant organisms has been carried (Mellmann et al. (2011) PLoS One 6:e22751; Rhode et al. (2011) NEJM, 365:718-724.), but the final assembly inevitably contains gaps in the sequence. In this application note we show how the fragment library data can be augmented with mate-pair library data to produce a much higher quality assembled sequence (larger contigs, scaffolds, and better N50 lengths). Improved sequence finishing using mate-pair library data As sequencing technologies evolve, researchers are able to take advantage of increased throughput, higher accuracy, and longer reads. Current genome sequencing efforts can roughly be divided into two camps Sanger shotgun sequencing and massively parallel high-throughput sequencing. Sanger sequencing yields very accurate reads up to 1,000 bases, but this methodology is more expensive and time consuming. The Ion Torrent PGM system produces read lengths of up to 200 bases, shorter than those of Sanger sequencing. Many researchers see this as a worthwhile tradeoff because next-gen sequencing produces a very large number of reads in a very short time period, making data gathering with this strategy less expensive and faster. Even if 1,000-base reads were routinely achievable, accurately arranging a multitude of 1,000-base fragments into a genome is still fraught with difficulties expansive sections of repeated sequence and variations in sequence quality in some regions prevent accurate assembly. So at the end of all fragment library sequencing projects, the assembled genome comprises a set of overlapping reads (contigs) interspersed with gaps where the sequence is not reliably identified. For some researchers, fragment library data alone is sufficient. For example, when establishing contigs for many comparative genomics approaches, the extra time, effort, and expense required to fill sequence gaps is not warranted. When the sequencing strategy calls for complete or near-complete sequence, however, many scientists opt for additional data from a different kind of library a mate-pair (MP) library. MP libraries are created by using directed molecular biology steps to capture extreme ends of much longer DNA fragments from hundreds of base pairs each to 50 kbp and package them in short fragments that are suitable for next-gen sequencing (Figure 1). This means that instead of getting positional information only for the stretch of sequence in each fragment library clone, you now get positional information over much larger distances (Figure 1). When a fragment assembly is augmented with the data obtained from a mate-pair library, contigs can be ordered into scaffolds and many of the sequence gaps can be closed, which converts smaller contigs into larger supercontigs and scaffolds (Figure 2) and results in longer N50 values. Adding one MP library to fragment library data generates a good draft assembly (small number of large scaffolds) with fairly accurate annotation. Adding a second MP library allows you to generate a sequence that approaches finished status.

Fragment library gives information over the length of the read only Sequence A Adaptor DNA Fragment P1 Adaptor ~230 bases representing one contiguous genomic piece Mate-pair library gives information over a much larger genomic distance Sequence A Adaptor End 1 ~80 bases Internal adaptor 36 bases End 2 ~80 P1 Adaptor ~200 bases comprising an adaptor flanked by two ~80-base segments taken from the extreme ends of a 3,000-base* genomic fragment * Note that this example assumes 3,000 bp fragments were used to construct the mate-pair library. In practice lengths from a few hundred base pairs to 50 kbp can be used. Figure 1. The mate-pair library delivers sequence position information over a much larger genomic distance than the fragment library.

Fragment library data More gaps, many, shorter contigs, lower N50 lengths Adding mate-pair library data Reduces gaps, produces fewer, longer contigs, higher N50 lengths Reference sequence Figure 2. By augmenting the fragment library data with mate-pair library data, gaps in the assembly can be closed and contigs can be joined into scaffolds. Results In order to demonstrate the benefits of Mate pairs a model organism (E. coli MGG17655) is sequenced and assembled using fragment data and Mate Pairs Mate-pair library construction In this study, genomic DNA is fragmented and size selected on an agarose gel before ligation with matepair adaptors. These DNA fragments are then circularized by hybridization so that mate-pair adaptors form an internal adaptor to connect both ends of DNA together. The DNA fragments containing mate-pair ends in the circularized DNA are released by a unique nick-translation reaction following with exonuclease treatment. The mate-pair DNA fragments are enriched and ligated with fragment library adaptors to form the mate-pair library. The detailed mate-pair library construction procedure is described in the user bulletin, Ion Mate-Paired Library Preparation (http://lifetechit.hosted.jivesoftware.com/docs/doc-1999). For this assembly two separate libraries with inserts 3.5 kb and 8.9 kb were constructed using this protocol. Pairing analysis after acquiring sequencing data After identifying the internal adapter sequence and then splitting the sequencing read into two tags, the individual tags are mapped to the reference genome and the distance between the reads is determined. This distance is the insert length of the library fragment from which these tags are derived. Plotting insert lengths of a 3.5 kb and a 8.9 kb mate-pair library demonstrates a constrained distribution around the expected size (Figure 3). [Note: a description of how to use SFF extract to identify the internal adaptor sequence can be found in the user bulletin, Using MIRA assembly with Ion Torrent PGM reads.docx at http://lifetech-it.hosted.jivesoftware.com/docs/doc-2163.]

2500 2000 8.9 kb insert 3.5 kb insert 1500 Count 1000 500 0 Insert size Figure 3. Insert length distribution of the mate-pair library. Assembling the sequence From the E. coli MG1655 genome, three libraries are constructed and sequenced: one fragment library (mean length = 207 bp), one 3.5 kb insert mate-pair library, and one 8.9 kb insert mate-pair library. After splitting the mate-pair reads into two tags and removing the internal adaptor sequence, the average tag lengths are 80 bp. Libraries are subsampled to yield combined coverage of approximately 40-fold and then assembled using MIRA (see http://lifetech-it.hosted.jivesoftware.com/docs/doc-2163). Because the MIRA assembler does not perform scaffolding, the output of this tool is a set of contigs. Subsequent joining of these contigs into scaffolds is performed using the standalone SSPACE software, which maps the ends of mate-paired reads to the set of contigs and identifies pairs that link adjacent contigs. Assembly statistics are described in Table 1.

Table 1. Improvements in contig and scaffold assemblies when mate-pair library data are used to augment fragment library data. Contig stats Scaffold stats Total consensus Number of contigs Largest contig Contig N50* % reference genome covered Total consensus Number of scaffolds Largest scaffold Scaffold N50* * See glossary for definition of N50. Frag only Equal coverage of fragment reads and mate-pair reads. Frags plus 8.9 kb mates Frags plus 8.9 kb mates plus 3.5 kb mates 4,607,823 4,615,388 4,613,542 83 75 79 357,570 327,207 344,146 119,187 157,428 177,681 99.998% 99.997% 99.997% n/a 4,671,736 4,659,192 n/a 38 22 n/a 3,046,212 2,337,293 n/a 3,046,212 2,337,293 The fragment library alone yields a contig assembly with an N50 of over 119 kb, and reference genome coverage of 99.998%. For certain applications, this level of assembly may be adequate for answering the biological question of interest. For example, about 98% of protein coding genes are expected to be contiguous, so genome content can be largely assessed. For applications that require a more complete draft genome assembly, mate-pair data can be generated and combined with the fragment data. Adding the 8.9 kb insert, 2 x 80 bp mate-pair data to the fragment data after subsampling each of the libraries down to 20-fold average coverage yielded further improvement in contig N50, with over half of the assembly in contigs larger than 157 kb. Scaffolding of these contigs using the 8.9 kb mate-pairs further joins the contigs into 38 scaffolds with an N50 of 3.05 Mb. By adding a second mate-pair library with a 3.5 kb insert size to the assembly, further improvements in scaffolding are realized, with a total of 22 scaffolds and 99.96% of the genome in the four largest scaffolds. Results of the three assemblies is depicted in Figure 4 in which contigs (blue) and scaffolds (green) are mapped to the MG1655 reference genome. Repeat sequences, including rrna loci and transposons are indicated around the outside of the circle. It should be noted that many gaps occur at these natural repeat sequences.

Figure 4. Circular plot of contigs and scaffold coverage. Assemblies are mapped against the E. coli MG1655 reference chromosome using the MUMmer software suite, and alignments are depicted using Circos. Contigs are in shades of blue and scaffolds are in shades of green. The inside circle represents contigs from the fragment library assembly. Moving outward, the next two circles are contigs and scaffolds from the fragment plus 8.9 kb MP assembly. The outer two alignments represent contigs and scaffolds from the assembly of fragment plus both MP libraries. Each contig or scaffold alignment is depicted as a block, with individual contigs differentiated by stagger and shades of color. Repetitive sequences, including mobile elements (red) and rrna loci (green) are indicated around the outside of the circle. Conclusion

Based on the data from this MG1655 sequencing project, it is clear that the protocol described in the report Ion Mate-Paired Library Preparation for creating mate-pair libraries and sequencing them on the Ion PGM System produces significantly improved assemblies, and when the reads are assembled using the MIRA assembler and the resulting data combined with fragment-read data, contig and scaffold assemblies are improved. Using SSPACE software to further join contigs based on mate-pair data results in extensive scaffolds that comprise the vast majority of the genome. The assembly improves significantly 83 contigs are seen with fragment only assemblies while using MP data produces 22 scaffolds and an extra 52kb of consensus sequence. Most of the remaining breaks in the MG1655 assembly are largely due to natural repeat sequences in the genome. Mate-pair vs. paired-end There is some confusion regarding these two terms depending on the source. In this application note, a mate-pair is defined as a sequencing molecule that places the extreme ends of a longer genomic DNA fragment physically close together using molecular biology techniques so that the sequence for both ends of this long DNA fragment is contained in a much shorter fragment. Paired-end sequencing describes a type of sequencing in which a single fragment is sequenced from both ends. Currently, paired-end reads are also enabled for PGM with certain modification on the library and sequencing condition (see Paired End Sequencing application note). N50 N50 is a weighted statistical measure of the median contig length in a set of sequences. Or, to put it another way, the N50 is length L (in base pairs) such that 50% of the bases are in contigs the size of L or greater. Larger N50 values correlate to more complete assemblies. N50 was calculated by sorting all of the lengths of the contigs from largest to smallest and then determining the minimum set of contigs (starting with the largest contig on the list and adding successive contig lengths) until the cumulative sizes total 50% of the assembled genome. For example, for 5,000 bases of assembled sequence, the N50 length is fragment size you arrive at in the list when the cumulative size is at least 2,500 bases. For a microbial genome such as E. coli O104:H4 (assuming an average bacterial gene size of 1 kb), an N50 length of 10 kb would mean that about 90% of the genes in that genome are intact; an N50 of 100 kb would mean that about 99% of the genes in that genome are intact.