Looking Ahead: Improving Workflows for SMRT Sequencing

Size: px
Start display at page:

Download "Looking Ahead: Improving Workflows for SMRT Sequencing"

Transcription

1 Looking Ahead: Improving Workflows for SMRT Sequencing Jonas Korlach FIND MEANING IN COMPLEXITY Pacific Biosciences, the Pacific Biosciences logo, PacBio, SMRT, and SMRTbell are trademarks of Pacific Biosciences in the United States and/or other countries. Covaris is a trademark of Covaris, Inc.; g-tube is a trademark of Bio Plas, Inc.; Caliper and Sciclone are trademarks of Caliper Life Sciences, Inc.; Agilent is a trademark of Agilent Technologies, Inc.; 454 is a trademark of Roche Diagnostics; and Illumina and Moleculo are trademarks of Illumina, Inc. Copyright 2013 by Pacific Biosciences of California, Inc. All rights reserved.

2 A Year Ago

3 Today

4 Requirements for Achieving High-Quality, Finished Genomes 1. High Consensus Accuracy >99.999% (QV50) Lack of systematic bias 2. Lack of sequence context bias GC content Low complexity sequence 3. Long sequence reads Resolve repeats, plasmids Full-length cdna sequencing Long-range haplotype phasing 4. Base modification detection Epigenome characterization

5 Finished Genomes to Fight Foodborne Outbreaks ~76 ~76 million illnesses each each year year ~325,000 hospitalizations $78 $78 billion billion economic loss loss (US) (US) High High serotype diversity Emerging hypervirulence

6 National Collection of Type Cultures (NCTC) Collaboration with Public Health England & the Wellcome Trust Sanger Institute Plan to finish 3000 bacterial and 500 viral genomes

7 Joint Genome Institute Production Pipeline

8 Joint Genome Institute Production Workflow Extract DNA Shear to 10 kb w/ Covaris g-tube devices Automated Library Prep on Caliper Sciclone Workstation Bacterial Sample Publication Quality Finished Genomes Automated Data Analysis SMRT Sequencing

9 Automated Library Preparation Bravo platform (Agilent): Sciclone platform (Caliper): Systems/Automation-Solutions/Bravo-Automated-Liquid-Handling- Platform/Pages/default.aspx %20NGS%20Workstation

10 Automated Library Preparation

11 Sample complexity Expanding the Scale Number of samples Sample complexity Number of samples

12 Sequencing Full-Length 16S RNA 16S RNA length covered Accuracy Collaboration with Chunlab, DNA Link, and Molecular Diagnostics Korea Presented at ASM Annual Meeting, Denver, May 2013

13 Measures of Diversity Water Soil Collaboration with Chunlab, DNA Link, and Molecular Diagnostics Korea

14 Measures of Diversity Water Soil Collaboration with Chunlab, DNA Link, and Molecular Diagnostics Korea

15 Sample complexity Expanding the Scale Number of samples Sample complexity Genome size Number of samples

16 Yeast De Novo HGAP Assembly I II III IV V VI VII VIII IX X XI XII XIII XIV XV XVI M 100 kb Reference (S228C): 17 contigs Genome size = 12.3 Mb data at De novo assembly: 30 contigs Assembly size = 12.3 Mb N50 = 770 kb Max contig = 1.5 Mb (chr. IV)

17 Malaria De Novo Assembly Plasmodium falciparum: million infections per year 1 million deaths per year 20% average GC content, 23.3 Mb genome 454 pyrosequencing* Sanger sequencing* Illumina sequencing* SMRT sequencing Progeny Parents Reference genome 30 SMRT Cells 7C126 SC05 Dd2 HB3 NP-3D7-S NP-3D7-L 3D7 Number of Contigs 9,452 9,597 4,511 2,971 26,920 22, N50 Contig Size (kb) ,242 Largest Contig (kb) ,534 Number of assembled bases (Mb) Average Coverage Sample provided by the Broad Institute & Sarah Volkmann (Harvard School of Public Health) *Samarakoon et al. (2011) BMC Genomics 12: 116.

18 Arabidopsis De Novo Assembly Original Col-0 assembly (Sanger) ~$70M, several years Sequenced & assembled Ler-0 strain: Column1 PacBio assembly Short-read assembly (2011)* Improvement Assembly size (bp) 124,572, ,357,164 12% # contigs 540 4, x Contig N50 (bp) 6,190,353 66,600 90x Max contig length (bp) 12,982, ,490 30x data at *

19 New Algorithm Developments

20 Run Time (Days) Hybrid Preassembly on HC-2ex OverlapInCore Dominates PacBioToCA Run Time for Large Genomes C x HC-2ex HC-2ex: 2@8 core Intel X GHz, 48GB DDR3, stripe 600GB SATA disk (host) 16GB SG (coprocessor) 16C x86: same, host only Parrot Data from original paper by Koren et al. OverlapInCore speed up 14.7x Standard server 19.8 days HC-2ex 1.2 days Ongoing optimization of demanding steps in hybrid and non-hybrid workflows with Pacific Biosciences

21 The Next Challenge: Assembling Diploid Genomes Build bioinformatics and visualization tools for building new algorithms that can resolve diploid genomes Early assembly result for the Ler-0 + Col-0 synthetic diploid.

22 Rice Genome Assembly (Oryza sativa pv Nipponbare: 400 MB) Contig N50 HiSeq Fragments 50x 180 3,925 MiSeq Fragments 23x 459bp 8x 450 Illumina Mates 50x x x 4800 PBeCR + Illumina reads 7x 3500bp ** MiSeq reads for correction PBeCR + Illumina reads 19x ** MiSeq reads for correction 6,332 18,248 50, kb With the RS, the contigs from our de novo assembly of the 400 Mbp rice genome are several fold better than the state-of-the-art ALLPATHS-LG assembly using short reads Michael C. Schatz, Ph.D. Assistant Professor of Quantitative Biology Cold Spring Harbor Laboratory M. Schatz AGBT talk

23 Applications Across All Genome Sizes

24 FIND MEANING IN COMPLEXITY Copyright 2012 by Pacific Biosciences of California, Inc. All rights reserved.