Comprehensive Views of Genetic Diversity with Single Molecule, Real-Time (SMRT) Sequencing

Similar documents
Applications of PacBio Single Molecule, Real- Time (SMRT) DNA Sequencing

Jenny Gu, PhD Strategic Business Development Manager, PacBio

Next Generation Sequencing. Jeroen Van Houdt - Leuven 13/10/2017

Welcome to the NGS webinar series

Sequencing technologies. Jose Blanca COMAV institute bioinf.comav.upv.es

Research school methods seminar Genomics and Transcriptomics

RNA-Sequencing analysis

SMARTer Ultra Low RNA Kit for Illumina Sequencing Two powerful technologies combine to enable sequencing with ultra-low levels of RNA

Mate-pair library data improves genome assembly

1.1 Post Run QC Analysis

Sequencing technologies. Jose Blanca COMAV institute bioinf.comav.upv.es

DNA-Sequencing. Technologies & Devices. Matthias Platzer. Genome Analysis Leibniz Institute on Aging - Fritz Lipmann Institute (FLI)

Next-Generation Sequencing. Technologies

Targeted Sequencing Using Droplet-Based Microfluidics. Keith Brown Director, Sales

DNA-Sequencing. Technologies & Devices. Matthias Platzer. Genome Analysis Leibniz Institute on Aging - Fritz Lipmann Institute (FLI)

Genomics and Transcriptomics of Spirodela polyrhiza

A near perfect de novo assembly of a eukaryotic genome using sequence reads of greater than 10 kilobases generated by the Pacific Biosciences RS II

High Throughput Sequencing Technologies. J Fass UCD Genome Center Bioinformatics Core Monday June 16, 2014

Introduction to Bioinformatics

HLA and Next Generation Sequencing it s all about the Data

Introductie en Toepassingen van Next-Generation Sequencing in de Klinische Virologie. Sander van Boheemen Medical Microbiology

Procedure & Checklist - Preparing SMRTbell Libraries using PacBio Barcoded Universal Primers for Multiplex SMRT Sequencing

Next Gen Sequencing. Expansion of sequencing technology. Contents

High Throughput Sequencing Technologies. UCD Genome Center Bioinformatics Core Monday 15 June 2015

Sequence Assembly and Alignment. Jim Noonan Department of Genetics

Sequence assembly. Jose Blanca COMAV institute bioinf.comav.upv.es

Third Generation Sequencing

Sequencing Theory. Brett E. Pickett, Ph.D. J. Craig Venter Institute

RNA-Seq with the Tuxedo Suite

Gene Expression Technology

ACCEL-NGS 2S DNA LIBRARY KITS

Bioinformatics Advice on Experimental Design

Incorporating Molecular ID Technology. Accel-NGS 2S MID Indexing Kits

IMGM Laboratories GmbH. Sales Manager

Whole Transcriptome Analysis of Illumina RNA- Seq Data. Ryan Peters Field Application Specialist

Genome Assembly. J Fass UCD Genome Center Bioinformatics Core Friday September, 2015

Introduction to Bioinformatics

Ion S5 and Ion S5 XL Systems

DNA-Sequencing. Technologies & Devices

A Roadmap to the De-novo Assembly of the Banana Slug Genome

Human genome sequence

CSC Assignment1SequencingReview- 1109_Su N_NEXT_GENERATION_SEQUENCING.docx By Anonymous. Similarity Index

RIPTIDE HIGH THROUGHPUT RAPID LIBRARY PREP (HT-RLP)

Next Generation Sequencing Lecture Saarbrücken, 19. March Sequencing Platforms

Introduction to the UCSC genome browser

FGCZ NEWSLETTER FALL Next Generation Sequencing at the Functional Genomics Center Zurich

QIAGEN s NGS Solutions for Biomarkers NGS & Bioinformatics team QIAGEN (Suzhou) Translational Medicine Co.,Ltd

MHC Region. MHC expression: Class I: All nucleated cells and platelets Class II: Antigen presenting cells

Targeted Sequencing in the NBS Laboratory

Mapping strategies for sequence reads

Single Cell Genomics

Gap Filling for a Human MHC Haplotype Sequence

Ion S5 and Ion S5 XL Systems

RNAseq Differential Gene Expression Analysis Report

Problem Set 8. Answer Key

INTRODUCCIÓ A LES TECNOLOGIES DE 'NEXT GENERATION SEQUENCING'

How much sequencing do I need? Emily Crisovan Genomics Core

Introduction to Bioinformatics and Gene Expression Technologies

To sequence or not to sequence is not a question anymore. BUT

De novo whole genome assembly

Next Generation Sequencing. Target Enrichment

De novo Genome Assembly

Machine Learning Methods for RNA-seq-based Transcriptome Reconstruction

Guidelines for Using the Sage Science Pippin Pulse Electrophoresis Power Supply System

De Novo Assembly of High-throughput Short Read Sequences

Themes: RNA and RNA Processing. Messenger RNA (mrna) What is a gene? RNA is very versatile! RNA-RNA interactions are very important!

BENG 183 Trey Ideker. Genome Assembly and Physical Mapping

Computational Biology I LSM5191

measuring gene expression December 5, 2017

Genome Sequencing-- Strategies

Opportunities offered by new sequencing technologies

Title: High-quality genome assembly of channel catfish, Ictalurus punctatus

Illumina (Solexa) Throughput: 4 Tbp in one run (5 days) Cheapest sequencing technology. Mismatch errors dominate. Cost: ~$1000 per human genme

NOW GENERATION SEQUENCING. Monday, December 5, 11

Introductory Next Gen Workshop

Introduction to Next Generation Sequencing (NGS)

High-throughput scale. Desktop simplicity.

Introduction to RNA-Seq

Gene Regulation Solutions. Microarrays and Next-Generation Sequencing

Structural variation. Marta Puig Institut de Biotecnologia i Biomedicina Universitat Autònoma de Barcelona

Towards detection of minimal residual disease in multiple myeloma through circulating tumour DNA sequence analysis

De novo genome assembly with next generation sequencing data!! "

Multiple choice questions (numbers in brackets indicate the number of correct answers)

Practical quality control for whole genome sequencing in clinical microbiology

Introduction Bioo Scientific

Hybrid Error Correction and De Novo Assembly with Oxford Nanopore

WELCOME. Norma J. Nowak, PhD Executive Director, NY State Center of Excellence in Bioinformatics and Life Sciences (CBLS)

Leonardo Mariño-Ramírez, PhD NCBI / NLM / NIH. BIOL 7210 A Computational Genomics 2/18/2015

Genome Sequencing Technologies. Jutta Marzillier, Ph.D. Lehigh University Department of Biological Sciences Iacocca Hall

Synthetic Biology. Sustainable Energy. Therapeutics Industrial Enzymes. Agriculture. Accelerating Discoveries, Expanding Possibilities. Design.

Standard Products Nucleic Acid Sample Submission Guideline

Sanger vs Next-Gen Sequencing

Long-read sequencing and de novo assembly of a Chinese genome

RNA-Seq Workshop AChemS Sunil K Sukumaran Monell Chemical Senses Center Philadelphia

Next Generation Sequencing: An Overview

BIOINFORMATICS 1 SEQUENCING TECHNOLOGY. DNA story. DNA story. Sequencing: infancy. Sequencing: beginnings 26/10/16. bioinformatic challenges

Cancer Genetics Solutions

Analysing genomes and transcriptomes using Illumina sequencing

Transcription:

Comprehensive Views of Genetic Diversity with Single Molecule, Real-Time (SMRT) Sequencing Alix Kieu Cruse November 2015 For Research Use Only. Not for use in diagnostics procedures. Copyright 2015 by Pacific Biosciences of California, Inc. All rights reserved.

AGENDA SMRT Technology Overview High quality reference genomes Microbiology and Infectious Diseases Animal and Plant Genomes Structural Variation detection Transcriptomic approaches IsoSeq Epigenomics base modification detection

SINGLE MOLECULE, REAL-TIME (SMRT ) DNA SEQUENCING Zero-Mode Waveguides Phospholinked Nucleotides Up to 1 million ZMWs per SMRT Cell SMRT Cells containing up to a million ZMWs are processed on PacBio Systems which simultaneously monitor each of the waveguides in real time.

KEY SEQUENCING CHARACTERISTICS 1. Contiguity - Sequence reads >10,000 bases - Some reads >50 kb 2. Accuracy - Achieves >99.999% (QV50) - Lack of systematic sequencing errors 3. Uniformity - Lack of GC content or sequence complexity bias 4. Originality - No DNA amplification - Epigenome characterization

QV CONSENSUS ACCURACY PERFORMANCE COMPARISON 70 60 ~QV 50 (99.999%) consensus accuracy coverage: 50 P6-C4: 30x ±10 40 30 20 10 0 20 40 60 80 100 Coverage P5-C3: 45x ±10 Reduced coverage requirements & throughput increases combined result in overall 3-4x performance improvement E. coli 20kb-insert library, resequencing analysis with SMRT Analysis v2.3

PCR-FREE SAMPLE PREP WORKFLOW MEANS LESS BIAS Sample Preparation Building of the SMRTbell template DNA Sample Fragment DNA Repair Ends Ligate Adapters Library sizes range from 250 bp to 30 kb Purify DNA Binding

SAMPLING OF APPLICATION REQUIREMENTS Input Material Sample Processing 50ng 1ug One 10 kb Library prep 1-2 SMRT Cells / 5 MB SMRT HGAP Portal Analysis Iso-Seq Total RNA>>cDNA 5ng-1ug Project Scope Specific: Whole Transcriptome Discovery 2-3 Libraries Size Selection with Blue Pippin SMRT Portal IsoSeq Analysis Long Amplicons (HLA) 50-500ng 1 library Capabilities for multiplexing SMRT Portal Long Amplicon Analysis Third Party GenDX /Conexio All use standard SMRTbell prep One 20 kb Library prep 5-10ug Size selection with Blue Pippin SMRT Portal HGAP Analysis One to three 20 kb Library prep 20-100ug Size selection with Blue Pippin SMRT Portal FALCON

SMRT SEQUENCING: BROADLY ACCEPTED AND USED 150+ PacBio sequencing sites worldwide 900+ Publications to date with PacBio data 1-2 New publications average per day 9

CONTIG N50 ASSEMBLY STATISTICS OF GENOMES ASSEMBLED USING ONLY PACBIO DATA

MICROBIOLOGY AND INFECTIOUS DISEASES APPLICATIONS

WHY GENOME FINISHING? Clostridium autoethanogenum Industrial microbe for commodity chemicals - PacBio-only assembly closed, high-quality genome without manual finishing - Full metabolic pathway reconstruction - Complete genome revealed genes important for biofuel production missed by short read technologies (~100 contigs) Brown et al. (2014) Comparison of single-molecule sequencing and hybrid approaches for finishing the genome of Clostridium autoethanogenum and analysis of CRISPR systems in industrial relevant Clostridia. Biotechnology for Biofuels 7:40

Tracking hospital associated bacteria with epidemiology and genomic sequencing Julie Segre, PhD Senior Investigator National Human Genome Research Institute Presented at the 60 th NIH Clinical Center Anniversary, November 6 th, 2013

#15 #14 #8 Patients and hospital environmental isolates: horizontal gene transfer of carbapemase resistance encoding plasmid? NIH outbreak K. pneumoniae ST258 #5 E. cloacae #1 #2 #3 #5 #4 #9 #13 #17 #7 #12 #16 #6 #11 #10 #18 #51 K. pneumoniae ST34 Sink E. cloacae #52 K. pneumoniae ST258 #19 #53 K. oxytoca Sink C. freundii June 2011 July Aug Sept Oct Nov Dec Jan 2012 carbapenem-resistant Enterobactericeae Klebsiella pneumoniae ST258 Klebsiella pneumoniae ST34 Enterobacter cloacae Jan 2013 NIH->Hospital B K. pneumoniae ST258 NIH ->Hospital A E. cloacae Klebsiella oxytoca Citrobacter freundii

Gallery of Plasmids 31 plasmids (27 unique) across the 10 isolates sequenced KPC-2 KPC-3

Full genome sequencing rules out patient-patient transmission Klebsiella pneumoniae Klebsiella pneumoniae pkpqil KPC-2 KPC-3 Targeted sequencing could have given misleading result Unrelated Identical pt #1 * pt #53

Horizontal gene transfer from patient to environment Klebsiella pneumoniae Citrobacter freundii Enterobacter cloacae KPC-2 KPC-3

PLANT AND ANIMAL GENOMES

CONTIG N50 ASSEMBLY STATISTICS OF GENOMES ASSEMBLED USING ONLY PACBIO DATA

OROPETIUM GENOME FOR DROUGHT STUDY Robert VanBuren & Todd Mockler Donald Danforth Plant Science Center, St. Louis Fresh 3 week dehydration 24 hr. rehydration Oropetium resurrection plant, 250 Mb genome 20

COMPARISON OF ORO GENOME ASSEMBLIES Illumina 6 Small Insert Libraries: 180bp insert PE 2X100bp 250bp insert PE 2X100bp 500bp insert PE 2X100bp 500bp insert PE 2X250bp 1kb insert PE 2X100bp 1kb insert PE 2x300bp 4 Mate Pair Libraries 3kb insert PE 2X76bp 8kb insert PE 2X76bp 9kb insert PE 2X76bp 18kb insert PE 2X76bp PacBio One 20 KB Insert Library: BluePippin Size Selection @ 15 kb Protocol P6-C4 Chemistry Mean read length : 12,872 bp N50 read length : 16,485 10x genome coverage >20kb Illumina Assembly Statistics # Scaffolds 14,216 Scaffold N50 11 kb Total size 158 Mb Total # Ns 1.4Mb Est genome assembled 63% PacBio Assembly Statistics (HGAP) # Polished Contigs 625 Contig N50 2.38 Mb Max Contig Length 7.98 Mb Sum of Contig Lengths 244.46 Mb Repeat Content 47 % Est genome assembled 98.30%

COMPARATIVE ANALYSIS HIGHLIGHTS SYNTENY AND GAP REGIONS Oropetium Genome Protein coding regions Gap regions Maize Genome Gaps in potential regulatory regions for gene expression and splicing highlighted Robert VanBuren, NSF Postdoctoral Fellow, Donald Danforth Plant Science Center (PAG 2015)

RESOLVING TANDEM REPEATS IN DROSOPHILA Y CHR Krsticevic, F. et al. G3. April 9, 2015, doi: 10.1534/g3.115.017277

ASSEMBLY COMPARISON OF MST77Y GENES Krsticevic, F. et al. G3. April 9, 2015, doi: 10.1534/g3.115.017277

ASSEMBLY IMPROVEMENTS WITH PBJELLY 2 - Improved contiguity Closed up to 89% of gaps Increased Contig N50 to 500 kb - Improved transcript alignments Lower quality draft genomes that start with a low contig N50 before addition of any PacBio data show less improvement - PAG 2014: Kim Worley Improving Genomes Using Long Reads and PBJelly 2

HUMAN BIOMEDICAL APPLICATIONS

STRUCTURAL VARIATION IS THE PREDOMINANT FORM OF SEQUENCE VARIATION IN THE HUMAN GENOME Non-SNP DNA variation accounts for 22% of all events identified in the donor, however they involve 74% of all variant bases. This suggests an important role for non-snp genetic alterations in defining the diploid genome structure. Levy et al. (2007) PLoS Biology 5: e254

Year Technology Assembler Sample 2007 ABI 3730 Celera HuRef 2009 Illumina GA 2010 454 GS Flx Titanium SOAP de novo Newbler BGI YH KB1 2010 Illumina GA ALLPATHS-LG NA12878 2013 2014 PACBIO DEFINES THE NEW GOLD STANDARD IN HUMAN GENOME DE NOVO ASSEMBLY 454 GS, HiSeq, MiSeq HiSeq, BAC clones Newbler Reference-guided RP11_0.7 CHM1 2014 PacBio RS II FALCON CHM1 0.11 0.007 0.006 0.024 0.13 0.14 4.38 http://www.ncbi.nlm.nih.gov/traces/study/?acc=srp044331 2015 PacBio RS II FALCON CHM13 12.98 2015 PacBio RS II FALCON AK1 7.28 2015 PacBio RS II FALCON HuRef 10.38 2015 PacBio RS II FALCON PC-9* 3.58 2015 PacBio RS II FALCON SK-BR-3* 2.56 *cancer cell lines 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 Contig N50 (Mb) Data sources: HuRef (Venter) (http://www.plosbiology.org/article/info:doi/10.1371/journal.pbio.0050254); BGI YH (http://genome.cshlp.org/content/ 20/2/265.abstract Table II); KB1 (http://www.nature.com/nature/journal/v463/n7283/full/nature08795.html); NA12878 (http://www.pnas.org/content/ early/2010/12/20/1017351108.abstract Table3); CHM1 Illumina (http://www.ncbi.nlm.nih.gov/assembly/gcf_000306695.2/)

HUMAN GENOME SEQUENCING WITH SMRT TECHNOLOGY Chaisson et al. (2014) Nature doi:10.1038/nature13907

PACBIO DATA VS. GRCH37 & 1000 GENOMES PROJECT - Closed 55% of interstitial gaps remaining in reference genome - Resolved 26,079 euchromatic structural variants at the base-pair level - ~22,000 (85%) of these are novel - 6,796 of the events map within 3,418 genes Chaisson et al. (2014) Nature doi:10.1038/nature13907

PACBIO DATA VS. PRESENT-DAY ILLUMINA DATA Notably, less than 1% of these variant are present in newer assemblies of the human genome, including GRCh38 and CHM1.1 (ref. 22) (derived primarily by Illumina sequencing technology). Chaisson et al. (2014) Nature doi:10.1038/nature13907

GENOME-WIDE STRUCTURAL VARIATION CHARACTERIZATION

GENOME-WIDE STRUCTURAL VARIATION CHARACTERIZATION - Structural variation survey on diploid genome - Integrates different sequencing methods as well as BioNano optical mapping - Integrated analysis pipeline described, available in cloud-based DNAnexus environment Here, we characterize the SV content of a personal genome with Parliament, a publicly available consensus SV-calling infrastructure that merges multiple data types and SV detection methods.

GENOME-WIDE STRUCTURAL VARIATION CHARACTERIZATION - Detecting structural variation with Illumina is difficult, even when integrating different paired-end (PE) data: Despite these benefits of a multi-algorithm approach, Illuminaonly discovery still only recovers approximately half of the 9,777 SVs identified by multi-source Parliament: PBHoney alone identifies 4,268 SVs supported by hybrid assembly, representing events invisible to PE data.

GENOME-WIDE STRUCTURAL VARIATION CHARACTERIZATION - With only 10x PacBio coverage: Applying multiple Parliament workflows, we demonstrate that while method integration is optimal for SV detection in Illumina paired-end data, the addition of long-read data can more than triple the number of SVs detectable in a personal genome.

Iso-Seq: Full transcript sequencing

DETERMINATION OF TRANSCRIPT ISOFORMS Gene mrna isoforms Short-read technologies (RNA-Seq): Full-length Iso-Seq method: Insufficient Connectivity Splice Isoform Uncertainty Full-length cdna Sequence Reads Splice Isoform Certainty No Assembly Required Reads spanning splice junctions

GENE IDENTIFICAT ION, EVEN IN WELL - C H AR AC T ERIZED H U M AN C ELL LINES AN D TISSUES, IS LIKELY FAR FROM COMPLETE 8,048 RefSeq-annotated, full-length isoforms and 5,459 predicted isoforms Over one-third of these are novel isoforms, including 273 RNAs from gene loci that have not previously been identified - Au et al. (2013) Characterization of the human ESC transcriptome by hybrid sequencing. PNAS doi: 10.1038/pnas.1320101110.

NOVEL ISOFORMS IDENTIFIED IN RICE CULTIVARS WITH FULL-LENGTH ISO-SEQ SEQUENCING Transcript length distribution differences can be observed between sequencing platforms PAG 2015 Poster: Dario Copetti, et. al. Rapid Full-Length Iso-Seq cdna sequencing of Rice mrna to Facilitate Annotation and Identify Splice-Site Variation

ISO-SEQ FOR IMPROVED ISOFORM DIFFERENTIATION Full-length transcript sequencing detects novel splice-site variants: alternative promoter/ poly(a) retained introns skipped exons alternative splice sites 42 In collaboration with Rod Wing, Arizona Genomics Institute

DISCOVERING FUNGAL TRANSCRIPTOME DIVERSITY

Fig 2. Long, high-quality, consensus sequences accurately benchmark transcript diversity. Gordon SP, Tseng E, Salamov A, Zhang J, Meng X, et al. (2015) Widespread Polycistronic Transcripts in Fungi Revealed by Single-Molecule mrna Sequencing. PLoS ONE 10(7): e0132628. doi:10.1371/journal.pone.0132628 http://journals.plos.org/plosone/article?id=info:doi/10.1371/journal.pone.0132628

Epigenetics And Base modifications

CHARACTERIZE EPIGENOMES UNIQUE TO SMRT SEQUENCING Read the Paper @ blog.pacificbiosciences.com

http://dx.doi.org/10.1111/j.1574-6976.2005.00006.x

HAEMOPHILUS INFLUENZAE EPIGENETICS

EPIGENETIC SWITCH REGULATES ADAPTIVE MECHANISMS

DNA METHYLATION IN C. ELEGANS

Greer et al. Cell 2015. GENOME DISTRIBUTION OF N6 -ADENINE IN C. ELEGANS - Important discoveries: - Identification of N6mA in C. elegans - Examination N6mA distribution - Investigation into the role of N6mA methylase and demethylase in epigenetic inheritance

Introducing the Sequel System

SEQUEL SYSTEM THE SCALABLE PLATFORM FOR SMRT SEQUENCING - Based on proven SMRT Technology - Increased capacity with 1 Million ZMWs/SMRT Cell - Scalable throughput - Decreased footprint and weight

CONCLUSIONS - Value of SMRT Sequencing - Longest available read lengths >10kb - High consensus accuracy - Uniform, unbiased coverage - Characterize dynamic microbial genomes, epigenomes and communities - Generate high quality references - Reveal the complexity of the transcriptome - Accelerate new discoveries in epigenetics

akieu@pacb.com For Research Use Only. Not for use in diagnostics procedures. Copyright 2015 by Pacific Biosciences of California, Inc. All rights reserved. Pacific Biosciences, the Pacific Biosciences logo, PacBio, SMRT, SMRTbell, Iso-Seq, and Sequel are trademarks of Pacific Biosciences. BluePippin and SageELF are trademarks of Sage Science. NGS-go and NGSengine are trademarks of GenDx. All other trademarks are the sole property of their respective owners. 55