A shotgun introduction to sequence assembly (with Velvet) MCB Brem, Eisen and Pachter

Similar documents
Introduction to metagenome assembly. Bas E. Dutilh Metagenomic Methods for Microbial Ecologists, NIOO September 18 th 2014

Sequence Assembly and Alignment. Jim Noonan Department of Genetics

High-Throughput Bioinformatics: Re-sequencing and de novo assembly. Elena Czeizler

Human Genome Sequencing Over the Decades The capacity to sequence all 3.2 billion bases of the human genome (at 30X coverage) has increased

CSE182-L16. LW statistics/assembly

De novo assembly of human genomes with massively parallel short read sequencing. Mikk Eelmets Journal Club

State of the art de novo assembly of human genomes from massively parallel sequencing data

Genome Assembly Using de Bruijn Graphs. Biostatistics 666

Contact us for more information and a quotation

Outline. The types of Illumina data Methods of assembly Repeats Selecting k-mer size Assembly Tools Assembly Diagnostics Assembly Polishing

De novo whole genome assembly

NEXT GENERATION SEQUENCING. Farhat Habib

De novo whole genome assembly

Lecture 18: Single-cell Sequencing and Assembly. Spring 2018 May 1, 2018

TruSPAdes: analysis of variations using TruSeq Synthetic Long Reads (TSLR)

DNA Sequencing and Assembly

Sequence assembly. Jose Blanca COMAV institute bioinf.comav.upv.es

We begin with a high-level overview of sequencing. There are three stages in this process.

A Short Sequence Splicing Method for Genome Assembly Using a Three- Dimensional Mixing-Pool of BAC Clones and High-throughput Technology

De Novo Assembly of High-throughput Short Read Sequences

Lecture 14: DNA Sequencing

De novo genome assembly with next generation sequencing data!! "

The Basics of Understanding Whole Genome Next Generation Sequence Data

Lectures 20, 21: Single- cell Sequencing and Assembly. Spring 2017 April 20,25, 2017

Next Generation Sequences & Chloroplast Assembly. 8 June, 2012 Jongsun Park

Lander-Waterman Statistics for Shotgun Sequencing Math 283: Ewens & Grant 5.1 Math 186: Not in book

Supplementary Figure 1. Design of the control microarray. a, Genomic DNA from the

De novo meta-assembly of ultra-deep sequencing data

Genome Assembly, part II. Tandy Warnow

DNA. bioinformatics. genomics. personalized. variation NGS. trio. custom. assembly gene. tumor-normal. de novo. structural variation indel.

CSCI2950-C DNA Sequencing and Fragment Assembly

10/20/2009 Comp 590/Comp Fall

Transcriptomics analysis with RNA seq: an overview Frederik Coppens

Next Generation Sequencing. Tobias Österlund

SCIENCE CHINA Life Sciences. Comparative analysis of de novo transcriptome assembly

High Throughput Sequencing the Multi-Tool of Life Sciences. Lutz Froenicke DNA Technologies and Expression Analysis Cores UCD Genome Center

A thesis submitted in partial fulfillment of the requirements for the degree in Master of Science

Genome Assembly. J Fass UCD Genome Center Bioinformatics Core Friday September, 2015

Next-generation sequencing technologies

Workflow of de novo assembly

de novo paired-end short reads assembly

The New Genome Analyzer IIx Delivering more data, faster, and easier than ever before. Jeremy Preston, PhD Marketing Manager, Sequencing

How is genome sequencing done?

de novo metagenome assembly

DE NOVO WHOLE GENOME ASSEMBLY AND SEQUENCING OF THE SUPERB FAIRYWREN. (Malurus cyaneus) JOSHUA PEÑALBA LEO JOSEPH CRAIG MORITZ ANDREW COCKBURN

BIOINFORMATICS 1 SEQUENCING TECHNOLOGY. DNA story. DNA story. Sequencing: infancy. Sequencing: beginnings 26/10/16. bioinformatic challenges

Genome Assembly: Background and Strategy

solid S Y S T E M s e q u e n c i n g See the Difference Discover the Quality Genome

G E N OM I C S S E RV I C ES

Genome Assembly Software for Different Technology Platforms. PacBio Canu Falcon. Illumina Soap Denovo Discovar Platinus MaSuRCA.

A Computer Simulator for Assessing Different Challenges and Strategies of de Novo Sequence Assembly

Mate-pair library data improves genome assembly

De novo assembly in RNA-seq analysis.

RNA-Sequencing analysis


Bioinformatics for Genomics

Experimental Design Microbial Sequencing

Next-Generation Sequencing. Technologies

The Diploid Genome Sequence of an Individual Human

Assembly. Ian Misner, Ph.D. Bioinformatics Crash Course. Bioinformatics Core

De novo whole genome assembly

Introduction to Next Generation Sequencing

ChIP-seq and RNA-seq

Alignment and Assembly

02 Agenda Item 03 Agenda Item

Genome Projects. Part III. Assembly and sequencing of human genomes

Lecture 7. Next-generation sequencing technologies

Representing Errors and Uncertainty in Plasma Proteomics

Whole Human Genome Sequencing Report This is a technical summary report for PG DNA

De Novo and Hybrid Assembly

Genomics and Transcriptomics of Spirodela polyrhiza

Class 35: Decoding DNA

Next Generation Sequencing. Jeroen Van Houdt - Leuven 13/10/2017

de novo Transcriptome Assembly Nicole Cloonan 1 st July 2013, Winter School, UQ

DNA concentration and purity were initially measured by NanoDrop 2000 and verified on Qubit 2.0 Fluorometer.

Outline. DNA Sequencing. Whole Genome Shotgun Sequencing. Sequencing Coverage. Whole Genome Shotgun Sequencing 3/28/15

De novo Genome Assembly

Next Gen Sequencing. Expansion of sequencing technology. Contents

Assembly and Validation of Large Genomes from Short Reads Michael Schatz. March 16, 2011 Genome Assembly Workshop / Genome 10k

Looking Ahead: Improving Workflows for SMRT Sequencing

BIOINFORMATICS ORIGINAL PAPER

Targeted Sequencing in the NBS Laboratory

Efficient Algorithms for Prokaryotic Whole Genome Assembly and Finishing

Efficient de novo assembly of highly heterozygous genomes from whole-genome shotgun short reads. Supplemental Materials

From Infection to Genbank

Metagenomic 3C, full length 16S amplicon sequencing on Illumina, and the diabetic skin microbiome

Wet-lab Considerations for Illumina data analysis

Genome Assembly Background and Strategy

Each cell of a living organism contains chromosomes

Matthew Tinning Australian Genome Research Facility. July 2012

Announcements. Coffee! Evalua,on. Dr. Yoshiki Sasai, R.I.P.

Genome Assembly Workshop Titles and Abstracts

ChIP-seq and RNA-seq. Farhat Habib

Introduction to Bioinformatics. Genome sequencing & assembly

Next-generation sequencing technologies

Compute- and Data-Intensive Analyses in Bioinformatics"

Introduction to Bioinformatics

GENETICS - CLUTCH CH.15 GENOMES AND GENOMICS.

Mapping strategies for sequence reads

Yellow-bellied marmot genome. Gabriela Pinho Graduate Student Blumstein & Wayne Labs EEB - UCLA

Transcription:

A shotgun introduction to sequence assembly (with Velvet) MCB 247 - Brem, Eisen and Pachter

Hot off the press January 27, 2009 06:00 AM Eastern Time llumina Launches Suite of Next-Generation Sequencing Kits New Kits Dramatically Increase Throughput and Bring Powerful Sequencing Applications Within Reach of Every Customer SAN DIEGO--(BUSINESS WIRE)--Illumina (NASDAQ:ILMN) today announced the release of new sequencing chemistry kits and complementary software for its Genome Analyzer system. These new kits and software enable researchers to generate 40% more reads per run and extend read length to greater than 75 base pairs (bp). Also launched is the new Mate Pair Library Preparation Kit, which provides support for generating longer insert paired-end libraries and is complementary to Illumina s existing short-end paired libraries. These new improvements enable researchers to generate 10 to 15 Gigabases (Gb) of high-quality data per run, more than doubling the output previously attainable on the Genome Analyzer. The availability of mate pair library kits and long paired-end reads has greatly increased the flexibility and capacity of our Illumina sequencers. I believe that they have greatly improved our ability to sequence cdna libraries and may even open up the possibility to do de novo sequencing on the Illumina sequencer, said W. Richard McCombie, Ph.D., Professor at the Cold Spring Harbor Laboratory. They are also greatly helping our medical resequencing by giving us more data and the ability to look for small insertions and deletions in patient samples. Illumina s unique combination of very high density and long reads allows researchers to economically take on a broad range of projects, such as whole human genome sequencing and de novo sequencing of complex organisms. In addition to the higher output and longer reads afforded by the new kits and software, Illumina s flexible mate pair technique allows researchers to generate paired-end insert libraries measuring two to five kilobases (kb) to more comprehensively catalogue large structural variations. Coupled with Illumina s standard paired-end insert libraries (200-500 bp), which are necessary for detection of smaller structural variants, these kits provide researchers with the most comprehensive set of library preparation tools for accurate and comprehensive sequencing and characterization of complex genomes. In addition to providing new solutions for de novo sequencing, the combination of short insert paired-end reads with the new longer insert mate pair sequencing is the most powerful approach for maximal coverage across the genome. This combination enables detection of the widest range of structural variant types and is essential for accurately identifying complex rearrangements, said David Bentley, Vice-President and Chief Scientist of DNA Sequencing at Illumina. Under an early access program, researchers at the National Center for Genome Resources (NCGR) have started working with the new long read and Mate Pair Library Kits. "At NCGR, the long read and mate pair chemistries are already enabling our cotton de novo and human resequencing projects. Four of our Genome Analyzers are now dedicated to 2 x 88 and 2 x 106 base pair runs, generating up to 20.5 Gigabases per run and a raw accuracy of greater than 99% over 106 base pairs. Additionally, we're excited to use these improvements for structural variant detection and metagenomics," said Greg May, Ph.D., Director of the Genome Center at NCGR.

Assembly basics (Paired) read length Insert size Coverage Contigs Scaffolds

Assembly basics (Paired) read length Insert size Coverage Contigs Scaffolds N50 metric

Assembly basics (Paired) read length Insert size Coverage Contigs Scaffolds Lander-Waterman model/equation/statistics N50 metric

The chicken (puzzle) and egg (assembly) The chicken is the sequenced part of the genome (you don t know what this is, but its definitely incomplete). This is the puzzle. The egg is the assembly you produce.

Contigs and Scaffolds

Notation L = read length T = minimum detectable overlap G = genome size N = number of reads NL G c = coverage ( ) θ = T L σ =1 θ

Lander-Waterman Expected number of islands: Ne cσ Expected number of islands consisting of j clones: Ne 2cσ (1 e cσ ) j 1 Expected number of contigs: Expected length of an island: Expected length of a contig: L 1 1 e cσ (ecσ c +1 σ e cσ ) Ne cσ Ne 2cσ L( ecσ 1 c +1 σ)

Quantifying an assembly In addition to recording # contigs, # scaffolds, etc. a popular number is the N50 size: The largest number E such that at least half of the bases are in contigs (scaffolds) larger than E. Example: If the contigs have sizes 7,4,3,2,2,1,1 (kb) the N50 contig size is

Quantifying an assembly In addition to recording # contigs, # scaffolds, etc. a popular number is the N50 size: The largest number E such that at least half of the bases are in contigs (scaffolds) larger than E. Example: If the contigs have sizes 7,4,3,2,2,1,1 (kb) the N50 contig size is 4kb

Fragment assembly Computational challenge: assemble individual short fragments (reads) into a single genomic sequence (superstring). Difficult because of: repeats, sequencing errors, sequencing bias, strand ambiguity, lack of unique solution, size of problem.

Computational complexity Problem: Given a set of strings, find a shortest string that contains all of them Input: Strings s 1,s 2,...s n. Desired output: A string s that contains all strings s 1,s 2,...s n as substrings, such that the length of s is minimized. This is a hard problem.

Example Set of strings: 000,001,010,011,100,101,110,111 A superstring: 000001010011100101110111

Example Set of strings: 000,001,010,011,100,101,110,111 A superstring: 000001010011100101110111 Shortest superstring: 0001110100

Represting assemblies with de Bruijn graphs

Velvet Overview Step 1: Construct the de Bruijn graph from the reads. Step 2: Simplification. Step 3: Error removal. Step 4: Resolution of repeats

Removing tips A tip is a chain of nodes that is disconnected on one end. They arise from sequencing errors and coverage gaps. Short tips are clipped (<2k-mer bp)

Untangling repeats using mate pairs

Comparison of assemblers

References Lander and Waterman (1988) Genomic mapping by fingerprinting random clones: a mathematical analysis, Jones and Pevzner (2004) An Introduction to Bioinformatics. Zerbino and Birney (2008) Velvet: Algorithms for de novo short read assembly using de Bruijn graphs.