SMRT-assembly Error correction and de novo assembly of complex genomes using single molecule, real-time sequencing

Similar documents
SMRT-assembly Error correction and de novo assembly of complex genomes using single molecule, real-time sequencing

Sequencing Pitfalls. Michael Schatz. July 16, 2013 URP Bioinformatics

Whole Genome Assembly and Alignment Michael Schatz. Nov 5, 2013 SBU Graduate Genetics

Advances in Genome Sequencing & Assembly

Whole Genome Assembly and Alignment Michael Schatz. Oct 23, 2012 CSHL Programming for Biology

Genome Sequencing & Assembly Michael Schatz. April 21, 2013 CSHL Genome Access

Whole Genome Assembly and Alignment Michael Schatz. Nov 20, 2013 CSHL Advanced Sequencing

De novo assembly of complex genomes using single molecule sequencing

Current'Advances'in'Sequencing' Technology' James'Gurtowski' Schatz'Lab'

Assembly and Validation of Large Genomes from Short Reads Michael Schatz. March 16, 2011 Genome Assembly Workshop / Genome 10k

Hybrid Error Correction and De Novo Assembly with Oxford Nanopore

A near perfect de novo assembly of a eukaryotic genome using sequence reads of greater than 10 kilobases generated by the Pacific Biosciences RS II

The Resurgence of Reference Quality Genome Sequence

The Resurgence of Reference Quality Genomes

De Novo Assembly of High-throughput Short Read Sequences

De Novo and Hybrid Assembly

Understanding Accuracy in SMRT Sequencing

Analysis of Structural Variants using 3 rd generation Sequencing

De novo genome assembly with next generation sequencing data!! "

Supplementary Data for Hybrid error correction and de novo assembly of single-molecule sequencing reads

1.1 Post Run QC Analysis

DNA Sequencing and Assembly

AMOS Assembly Validation and Visualization

Reference-quality diploid genomes without de novo assembly

Sequence assembly. Jose Blanca COMAV institute bioinf.comav.upv.es

De novo whole genome assembly

Improving Genome Assemblies without Sequencing

Genomic Technologies. Michael Schatz. Feb 1, 2018 Lecture 2: Applied Comparative Genomics

Transcriptomics analysis with RNA seq: an overview Frederik Coppens

Introduction to metagenome assembly. Bas E. Dutilh Metagenomic Methods for Microbial Ecologists, NIOO September 18 th 2014

Long Read Sequencing Technology

Assemblytics: a web analytics tool for the detection of assembly-based variants Maria Nattestad and Michael C. Schatz

Sars International Centre for Marine Molecular Biology, University of Bergen, Bergen, Norway

Consensus Ensemble Approaches Improve De Novo Transcriptome Assemblies

Revolutionize Genomics with SMRT Sequencing. Single Molecule, Real-Time Technology

BIOINFORMATICS ORIGINAL PAPER

Looking Ahead: Improving Workflows for SMRT Sequencing

Single cell & single molecule analysis of cancer

100 GENOMES IN 100 DAYS: THE STRUCTURAL VARIANT LANDSCAPE OF TOMATO GENOMES

High Throughput Sequencing Technologies. UCD Genome Center Bioinformatics Core Monday 15 June 2015

Supplementary Methods

100 Genomes in 100 Days: The Structural Variant Landscape of Tomato Genomes

High Throughput Sequencing Technologies. J Fass UCD Genome Center Bioinformatics Core Tuesday December 16, 2014

TruSPAdes: analysis of variations using TruSeq Synthetic Long Reads (TSLR)

Hybrid assembly of non-model lizards

Genome Assembly. J Fass UCD Genome Center Bioinformatics Core Friday September, 2015

Transcriptome analysis

De novo whole genome assembly

Variation detection based on second generation sequencing data. Xin LIU Department of Science and Technology, BGI

Mate-pair library data improves genome assembly

CloG: a pipeline for closing gaps in a draft assembly using short reads

Next Generation Sequencing. Jeroen Van Houdt - Leuven 13/10/2017

Sequence Assembly and Alignment. Jim Noonan Department of Genetics

Sequencing technologies

High Throughput Sequencing Technologies. J Fass UCD Genome Center Bioinformatics Core Monday September 15, 2014

Next Gen Sequencing. Expansion of sequencing technology. Contents

Improving PacBio Long Read Accuracy by Short Read Alignment

NGS technologies approaches, applications and challenges!

High Throughput Sequencing Technologies. J Fass UCD Genome Center Bioinformatics Core Monday June 16, 2014

First steps towards automated metagenomic assembly

Heterozygosity, Phased Genomes, and Personalized-omics

Experimental Design. Sequencing. Data Quality Control. Read mapping. Differential Expression analysis

Variant Callers. J Fass 24 August 2017

ABSTRACT COMPUTATIONAL METHODS TO IMPROVE GENOME ASSEMBLY AND GENE PREDICTION. David Kelley, Doctor of Philosophy, 2011

Read Quality Assessment & Improvement. UCD Genome Center Bioinformatics Core Tuesday 14 June 2016

ABSTRACT. Genome Assembly:

Nature Biotechnology: doi: /nbt Supplementary Figure 1. Read Complexity

A shotgun introduction to sequence assembly (with Velvet) MCB Brem, Eisen and Pachter

Course summary. Today. PCR Polymerase chain reaction. Obtaining molecular data. Sequencing. DNA sequencing. Genome Projects.

Overview of Next Generation Sequencing technologies. Céline Keime

Next Generation Sequencing. Tobias Österlund

De novo whole genome assembly

Illumina (Solexa) Throughput: 4 Tbp in one run (5 days) Cheapest sequencing technology. Mismatch errors dominate. Cost: ~$1000 per human genme

ChIP-seq and RNA-seq

Validation of synthetic long reads for use in constructing variant graphs for dairy cattle breeding

A Short Sequence Splicing Method for Genome Assembly Using a Three- Dimensional Mixing-Pool of BAC Clones and High-throughput Technology

COPE: An accurate k-mer based pair-end reads connection tool to facilitate genome assembly

Assembly of Ariolimax dolichophallus using SOAPdenovo2

Next Generation Sequencing for Metagenomics

Sequencing technologies. Jose Blanca COMAV institute bioinf.comav.upv.es

Copyright (c) 2008 Daniel Huson. Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation

De novo assembly of human genomes with massively parallel short read sequencing. Mikk Eelmets Journal Club

Aaron Liston, Oregon State University Botany 2012 Intro to Next Generation Sequencing Workshop

RNA-Sequencing analysis

Emerging applications of SMRT Sequencing

Genome Assembly Software for Different Technology Platforms. PacBio Canu Falcon. Illumina Soap Denovo Discovar Platinus MaSuRCA.

Lecture 14: DNA Sequencing

Illumina Read QC. UCD Genome Center Bioinformatics Core Monday 29 August 2016

ChIP-seq and RNA-seq. Farhat Habib

The Iso-Seq Method: Transcriptome Sequencing Using Long Reads

Introductie en Toepassingen van Next-Generation Sequencing in de Klinische Virologie. Sander van Boheemen Medical Microbiology

Deep Sequencing technologies

The Diploid Genome Sequence of an Individual Human

AUDREY FARBOS JEREMIE POSCHMANN PAUL O NEILL KONRAD PASZKIEWICZ KAREN MOORE

Alignment and Assembly

Genome Sequence Assembly

Mapping Next Generation Sequence Reads. Bingbing Yuan Dec. 2, 2010

Biol 478/595 Intro to Bioinformatics

2nd (Next) Generation Sequencing 2/2/2018

Next-generation sequencing Technology Overview

Transcription:

SMRT-assembly Error correction and de novo assembly of complex genomes using single molecule, real-time sequencing Michael Schatz May 10, 2012 Biology of Genomes @mike_schatz / #bog12

Ingredients for a good assembly Coverage Read Length Quality Expected Contig Length 100 1k 10k 100k 1M dog N50 + dog mean + panda N50 + panda mean + 1000 bp 710 bp 250 bp 100 bp 52 bp 30 bp 0 5 10 15 20 25 30 35 40 Read Coverage High coverage is required Oversample the genome to ensure every base is sequenced with long overlaps between reads Biased coverage will also fragment assembly Reads & mates must be longer than the repeats Short reads will have false overlaps forming hairball assembly graphs With long enough reads, assemble entire chromosomes into contigs Errors obscure overlaps Reads are assembled by finding kmers shared in pair of reads High error rate requires very short seeds, increasing complexity and forming assembly hairballs Current challenges in de novo plant genome sequencing and assembly Schatz MC, Witkowski, McCombie, WR (2012) Genome Biology. In Press.

Hybrid Sequencing Illumina Sequencing by Synthesis High throughput (60Gbp/day) High accuracy (~99%) Short reads (~100bp) Pacific Biosciences SMRT Sequencing Lower throughput (600Mbp/day) Lower accuracy (~85%) Long reads (10kbp+)

SMRT Sequencing Imaging of florescent phospholinked labeled nucleotides as they are incorporated by a polymerase anchored to a Zero-Mode Waveguide (ZMW). Intensity Time http://www.pacificbiosciences.com/assets/files/pacbio_technology_backgrounder.pdf

SMRT Sequencing Data Coverage 45 40 35 30 25 20 15 10 5 0 Yeast (12 Mbp genome) 65 SMRT cells 734,151 reads after filtering Mean: 642.3 +/- 587.3 Median: 553 Max: 8,495 1 250 500 750 1000 1250 1500 1750 2000 2250 2500 2750 3000 3250 3500 3750 4000 4250 4500 4750 5000 Min Read Length TTGTAAGCAGTTGAAAACTATGTGTGGATTTAGAATAAAGAACATGAAAG TTGTAAGCAGTTGAAAACTATGTGT-GATTTAG-ATAAAGAACATGGAAG ATTATAAA-CAGTTGATCCATT-AGAAGA-AAACGCAAAAGGCGGCTAGG A-TATAAATCAGTTGATCCATTAAGAA-AGAAACGC-AAAGGC-GCTAGG CAACCTTGAATGTAATCGCACTTGAAGAACAAGATTTTATTCCGCGCCCG C-ACCTTG-ATGT-AT--CACTTGAAGAACAAGATTTTATTCCGCGCCCG TAACGAATCAAGATTCTGAAAACACAT-ATAACAACCTCCAAAA-CACAA T-ACGAATC-AGATTCTGAAAACA-ATGAT----ACCTCCAAAAGCACAA -AGGAGGGGAAAGGGGGGAATATCT-ATAAAAGATTACAAATTAGA-TGA GAGGAGG---AA-----GAATATCTGAT-AAAGATTACAAATT-GAGTGA ACT-AATTCACAATA-AATAACACTTTTA-ACAGAATTGAT-GGAA-GTT ACTAAATTCACAA-ATAATAACACTTTTAGACAAAATTGATGGGAAGGTT TCGGAGAGATCCAAAACAATGGGC-ATCGCCTTTGA-GTTAC-AATCAAA TC-GAGAGATCC-AAACAAT-GGCGATCG-CTTTGACGTTACAAATCAAA ATCCAGTGGAAAATATAATTTATGCAATCCAGGAACTTATTCACAATTAG ATCCAGT-GAAAATATA--TTATGC-ATCCA-GAACTTATTCACAATTAG Sample of 100k reads aligned with BLASR requiring >100bp alignment Average overall accuracy: 83.7%, 11.5% insertions, 3.4% deletions, 1.4% mismatch

Read Quality Empirical Error Rate 0.05 0.10 0.15 0.20 0.25 0 1000 2000 3000 4000 Read Position Consistent quality across the entire read Uniform error rate, no apparent biases for GC/motifs Sampling artifacts at beginning and ends of alignments

Consensus Accuracy and Coverage cns error rate 0.0 0.1 0.2 0.3 0.4 observed consensus error rate expected consensus error rate (e=.20) expected consensus error rate (e=.16) expected consensus error rate (e=.10) 0 5 10 15 20 25 coverage Coverage can overcome random errors Dashed: error model from binomial sampling; solid: observed accuracy For same reason, CCS is extremely accurate when using 5+ subreads CNS Error= c, i= () c/2*+ # " c i $ & e % ( ) i ( 1' e) n'i

SMRT-hybrid Assembly Co-assemble long reads and short reads Long reads (orange) natively span repeats (red) Guards against mis-assemblies in draft assembly Use all available data at once Challenges Long reads have too high of an error rate to assemble directly Assembler must supports a wide mix of read lengths

PacBio Error Correction http://wgs-assembler.sf.net 1. Correction Pipeline 1. Map short reads (SR) to long reads (LR) 2. Trim LRs at coverage gaps 3. Compute consensus for each LR 2. Error corrected reads can be easily assembled, aligned Hybrid error correction and de novo assembly of single-molecule sequencing reads. Koren, S, Schatz, MC, Walenz, BP, Martin, J, Howard, J, Ganapathy, G, Wang, Z, Rasko, DA, McCombie, WR, Jarvis, ED, Phillippy, AM. (2012) Under Review

Error Correction Results Correction results of 20x PacBio coverage of E. coli K12 corrected using 50x Illumina

1. Pre-overlap Consistency checks 2. Trimming Quality trimming & partial overlaps 3. Compute Overlaps Find high quality overlaps 4. Error Correction Evaluate difference in context of overlapping reads 5. Unitigging Merge consistent reads 6. Scaffolding Bundle mates, Order & Orient 7. Finalize Data Build final consensus sequences Celera Assembler http://wgs-assembler.sf.net

SMRT-Assembly Results Hybrid assembly results using error corrected PacBio reads Meets or beats Illumina-only or 454-only assembly in every case

PacBio Long Read Advantages c (a) Long reads close sequencing gaps (b) Long reads assemble across long repeats a b (c) Long reads span complex tandem repeats

Transcript Alignment Long-read single-molecule sequencing has potential to directly sequence full length transcripts 11% mapped by BLAT pre-correction, 99% after correction Directly observe alternative splicing events New collaboration with Gingeras Lab looking at splicing in human

Single Molecule Sequencing Summary PacBio RS has capabilities not found in any other technology Substantially longer reads -> span repeats Unbiased sequence coverage -> close sequencing gaps Single molecule sequencing -> haplotype phasing, alternative splicing Long reads enables highest quality de novo assembly Longer reads have more information than shorter reads Because the errors are random we can compensate for them One chromosome, one contig achieved in microbes Exciting developments on the horizon Longer reads, higher throughput PacBio Nanopore Sequencing

Acknowledgements Schatzlab Giuseppe Narzisi CSHL Dick McCombie NBACC Adam Phillippy Univ. of Maryland Mihai Pop James Gurtowski Melissa Kramer Sergey Koren Art Delcher Mitch Bekritsky Eric Antonio Jimmy Lin Hayan Lee Mike Wigler JHU David Kelley Robert Aboukhalil Zach Lippman Steven Salzberg Dan Sommer Charles Underwood Doreen Ware Ben Langmead Cole Trapnell Ivan Iossifov Jeff Leek

Thank You Poster 209: Giuseppe Narzisi et al. Detection and validation of de novo mutations in exomecapture data using micro-assembly Want to push the limits of biotechnology and bioinformatics? http://schatzlab.cshl.edu/apply/