The Basics of Understanding Whole Genome Next Generation Sequence Data

Similar documents
The Basics of Understanding Whole Genome Next Generation Sequence Data

Introduction to NGS Analysis Tools

Validating Bionumerics 7.6: A strategic approach from Oregon

Subtyping the top 30 Salmonella serotypes using a combination of CRISPR elements and virulence genes: Salmonella CRISPR-MLVST

Whole Genome Sequence Data Quality Control and Validation

Data Basics. Josef K Vogt Slides by: Simon Rasmussen Next Generation Sequencing Analysis

EURL WORKING GROUP ON WHOLE GENOME SEQUENCING AND PULSENET INTERNATIONAL

Whole Genome Sequencing for food safety FSA Chief Scientific Advisor Report and 2013 Listeria pilot study

Detecting Clusters and Reporting Results

From Bands to Base Pairs: Implementation of WGS in a PulseNet Laboratory

Developing Tools for Rapid and Accurate Post-Sequencing Analysis of Foodborne Pathogens. Mitchell Holland, Noblis

Introduction to PulseNet WGS Tools in BioNumerics v7.6

Canada's IRIDA platform for genomic epidemiology. Gary Van Domselaar Chief, Bioinformatics National Microbiology Lab Public Health Agency of Canada

Introduction to CGE tools

Whole Genome Sequencing for Enteric Pathogen Surveillance and Outbreak Investigations

Sequence assembly. Jose Blanca COMAV institute bioinf.comav.upv.es

Introduction to metagenome assembly. Bas E. Dutilh Metagenomic Methods for Microbial Ecologists, NIOO September 18 th 2014

Updates from CDC: Cluster Detection and Reporting Guidelines

Targeted Sequencing in the NBS Laboratory

Sequence Assembly and Alignment. Jim Noonan Department of Genetics

Beef Industry Safety Summit Renaissance Austin Hotel 9721 Arboretum Blvd. Austin, TX March 1-3

Bringing Whole Genome Sequencing on Board in a State Regulatory Laboratory

Analytics Behind Genomic Testing

De Novo and Hybrid Assembly

VTEC strains typing: from traditional methods to NGS

Introduction to Whole Genome Sequencing and its Applications in Microbial Diagnostics

The Diploid Genome Sequence of an Individual Human

Introduction to Whole Genome Sequencing and its Applications in Microbial Diagnostics

Setting the Course: Virginia's experience navigating information technology and bioinformatics needs for whole genome sequencing

Next generation sequencing in diagnostic laboratories: opportunities and challenges

GENOME ASSEMBLY FINAL PIPELINE AND RESULTS

Overview of CIDT Challenges and Opportunities

De novo Genome Assembly

IFSH WHOLE GENOME SEQUENCING FOR FOOD INDUSTRY SYMPOSIUM May 22-23, 2017

Connect-A-Contig Paper version

Whole-Genome Sequencing (WGS) for Food Safety

Lees J.A., Vehkala M. et al., 2016 In Review

A shotgun introduction to sequence assembly (with Velvet) MCB Brem, Eisen and Pachter

Introduction to Bioinformatics

Current status of universal whole genome sequencing of Mycobacterium tuberculosis in the United States

THE RISE OF WHOLE GENOME SEQUENCING AS A SUBTYPING TOOL FOR MICROBIAL SOURCE TRACKING: FROM FUNDAMENTALS TO APPLICATIONS

ESCMID Online Lecture Library. by author

Introduction to Whole Genome Sequencing and its Applications in Microbial Diagnostics

Mate-pair library data improves genome assembly

Sequencing technologies. Jose Blanca COMAV institute bioinf.comav.upv.es

DNA. bioinformatics. genomics. personalized. variation NGS. trio. custom. assembly gene. tumor-normal. de novo. structural variation indel.

2nd (Next) Generation Sequencing 2/2/2018

Assemblytics: a web analytics tool for the detection of assembly-based variants Maria Nattestad and Michael C. Schatz

Genome Assembly Background and Strategy

Genome Sequencing-- Strategies

Metagenomic 3C, full length 16S amplicon sequencing on Illumina, and the diabetic skin microbiome

Bioinformatics Tools and Pipelines for Real-Time Pathogen Surveillance

Next Gen Sequencing. Expansion of sequencing technology. Contents

Bioinformatics- Data Analysis

TruSPAdes: analysis of variations using TruSeq Synthetic Long Reads (TSLR)

Human Genome Sequencing Over the Decades The capacity to sequence all 3.2 billion bases of the human genome (at 30X coverage) has increased

Computational assembly for prokaryotic sequencing projects

14 March, 2016: Introduction to Genomics

TECHNICAL REPORT. Fifth external quality assessment scheme for Listeria monocytogenes typing.

The implementation and application of Whole Genome Sequencing in the Campylobacter Reference Laboratory at Public Health England Craig Swift

Sequencing technologies. Jose Blanca COMAV institute bioinf.comav.upv.es

Sequencing technologies. Jose Blanca COMAV institute bioinf.comav.upv.es

Fast, Accurate and Sensitive DNA Variant Detection from Sanger Sequencing:

Whole-genome sequencing (WGS) of microbes employing nextgeneration sequencing (NGS) technologies enables pathogen

DNA Sequencing and Assembly

CDC s Advanced Molecular Detection (AMD) Sequence Data Analysis and Management

UHT Sequencing Course Large-scale genotyping. Christian Iseli January 2009

Genome Assembly Using de Bruijn Graphs. Biostatistics 666

PFGE Shortcuts that Save a Penny but Cost a Dollar

Whole genome and core genome multilocus sequence typing and single nucleotide

Antisera QC and IQCP and Associated Challenges

Data Intensive Biomedical Research: The EU RL VTEC efforts to take up the NGS challenge. EU RL for E. coli Annual Workshop 2015

Variant Callers. J Fass 24 August 2017

Using New ThiNGS on Small Things. Shane Byrne

Course summary. Today. PCR Polymerase chain reaction. Obtaining molecular data. Sequencing. DNA sequencing. Genome Projects.

DNA concentration and purity were initially measured by NanoDrop 2000 and verified on Qubit 2.0 Fluorometer.

Introductie en Toepassingen van Next-Generation Sequencing in de Klinische Virologie. Sander van Boheemen Medical Microbiology

by author Bacterial typing - what methodology should I use? MTE Session ECCMID 2017 VIENNA, 25 APRIL 2017 L u í s a V i e i ra P e i xe

Using Galaxy for the analysis of NGS-derived pathogen genomes in clinical microbiology

Matthew Tinning Australian Genome Research Facility. July 2012

Rue Juliette Wytsmanstraat Brussels Belgium T F

SEQUENCE QUALITY CONSIDERATIONS FOR THE WET LAB

2012 PulseNet PFGE Laboratory Update: Everything Old Is New Again

Analysis of structural variation. Alistair Ward USTAR Center for Genetic Discovery University of Utah

From Variants to Pathways: Agilent GeneSpring GX s Variant Analysis Workflow

Gap Filling for a Human MHC Haplotype Sequence

Human Genetic Variation. Ricardo Lebrón Dpto. Genética UGR

2014 APHL Next Generation Sequencing (NGS) Survey

ESCMID Online Lecture Library. by author

Genome Sequence Assembly

DNA polymorphisms and RNA-Seq alternative splicing blow bubbles in de Bruijn Graphs

Next Generation Sequencing Applications in Food Safety and Quality

Next Generation Sequencing. Jeroen Van Houdt - Leuven 13/10/2017

Whole genome sequencing in the reference laboratory: An Introduction & Overview

The Human Genome and its upcoming Dynamics

De novo assembly of human genomes with massively parallel short read sequencing. Mikk Eelmets Journal Club

Variant calling workflow for the Oncomine Comprehensive Assay using Ion Reporter Software v4.4

De novo genome assembly with next generation sequencing data!! "

NEXT GENERATION SEQUENCING. Farhat Habib

De novo whole genome assembly

Transcription:

The Basics of Understanding Whole Genome Next Generation Sequence Data Heather Carleton-Romer, MPH, Ph.D. ASM-CDC Infectious Disease and Public Health Microbiology Postdoctoral Fellow PulseNet USA Next Generation Subtyping Unit NCEZID/DFWED/EDLB 2013 InFORM National Center for Emerging and Zoonotic Infectious Diseases Division of Foodborne, Waterborne and Environmental Diseases

Objectives Provide a basic overview of the terminology surrounding whole genome sequence (WGS) data Explain ways to analyze WGS data to characterize isolates

Next Generation Sequence Data Generation Leading NGS Benchtop Sequencers Sequence output Millions of reads Gigabytes sequencing data per run What do you do with it? Assemble genomes Ion Torrent PGM Whole genome analyses Illumina MiSeq

Raw Read WGS terms for Raw Read Quality Single sequencing output from your NGS machine; length depends on sequencing chemistry Quality scores Likelihood the base call is correct each base read from the machine will have a quality score Used for trimming reads Coverage Average divide the total # of bases by the genome size (i.e. 156,000,000 (total bases from sequencer)/ 3,000,000 (size of genome = 52x coverage) Specific how many reads span the 1 base you are looking at

WGS terms for Assembly Contig Assembly of overlapping reads into a single longer piece of DNA Reference genome Well characterized genome that was assembled into 1 contig per chromosome and is fully circularized Assembly de novo versus reference guided

Life of a NGS Read - Assembly Raw Read Trim base pairs with bad quality scores Trimmed Read Assemble reads into a contig (de novo assembly) trimmed reads Contig

Assessing Assembly Quality Assembly metrics can indicate sequence quality Number of contigs raw reads assembles into Good: E. coli <200, Salmonella < 100, Listeria < 30 N50 statistic is similar to median contig length Good: >200,000 bp Can extract genes and other genetic elements from an assembly for a whole genome multilocus sequence type (wgmlst) Some genes or genome regions may be missing because that part of the genome did not sequence well and no reads covered it

Ways to Analyze WGS data Kmer analysis SNP analyses K-mer SNP hqsnp Computational demands Whole genome multilocus sequence typing (wgmlst)

K-mer : K-mer analysis Raw reads are chopped up into k-mers of length k using computer algorithms k is determined by which length gives you the best specificity and most adequate resolution Identify unique and similar k-mers to determine how related isolates are to each other Identify unique k-mers Compare similar k-mers from different isolates

SNP Analysis Terms Single Nucleotide Polymorphism (SNP) ATGTTCCTC sequence ATGTTGCTC reference Insertion or Deletion (Indel) ATGTTCCCTC sequence ATGTTC-CTC reference

Ways to perform SNP Analysis Reference-based SNP calling High quality SNP (hqsnp) Raw reads are mapped to a highly related reference Chosen based on coverage and read frequency at SNP location Shows the phylogenetic relationship ATGTTACTC ATGTTCCTC ATGTTCCTC ATGTTCCTC ATGTTCCTC Raw Re ad s ATGTTCCTC ATGTTTCTC ATGTTCCTC ATGTTCCTC ATGTTCCTC ATGTTGCTC reference Is it a SNP? Is it a band?

Caveat s t o hqsnp Analysis Advantages: Phylogenetically informative SNP position can be identified on genome to determine what gene or intragenic region contains the SNP Disadvantages: Requires a closed reference or good draft genome Recent closed references from all serotypes are not available Computationally costly Requires multiple sequence alignment to a reference

Ways to perform SNP Analysis K-mer based SNP calling (ksnp) Raw reads are chopped up into k-mers of length k (k can vary from 13-35bp (or longer), must be an odd number) K-mers screened for variation at the center position AAAAAATGTTACTCGGATAAC Isolate 1 AAAAAATGTTGCTCGGATAAC Isolate 2 AAAAAATGTTACTCGGATAAC Isolate 3 K-mers from AAAAAATGTTCCTCGGATAAC Isolate 4 different isolates AAAAAATGTTACTCGGATAAC Isolate 5-21 bp kmer with variant position at bp 11

Caveat s t o k-mer SNP analysis Advantages: Does not require a reference or multiple sequence alignment Relatively fast analysis Does not require assembly Disadvantages: K-mer SNPs from raw reads do not identify the exact location of the SNP on a genome (from raw reads) Does not consider sequence quality Cannot detect SNPs located close to each other

SNP based analyses how do you solve a puzzle Different ways to solve? Puzzle pieces = raw reads Start with the edges then fit pieces = aligning to a reference to find hqsnps Start in the middle with matching color pieces (kmer) = Aligning kmers to find kmer SNPS

Whole Genome MLST Compare gene content between different isolates (can compare over 4000 genes) Differences between isolates based on SNPs or Indels within genes both genomes contain Can look at virulence genes, serotyping genes, and antibiotic resistance determinants between isolates Software like BIGSdb and BioNumerics 7.5 can run these analyses

Concluding remarks WGS Opportunities Universal high resolution subtyping method All information currently obtained by traditional methods contained in the sequence data Can use to identify serotype, virulence genes, resistance genes Challenges Large amounts of data presents storage and analysis issues No standardization for quality metrics for analysis Backwards comparability of WGS data with PFGE difficult to establish Interpretation of data how to define clusters?

Quest ions? For more information please contact Centers for Disease Control and Prevention PulseNet/ CDC 1600 Clifton Road NE, Atlanta, GA 30333 Telephone: 1-404-718-4269 E-mail: hcarleton@cdc.gov Web: http://www.cdc.gov/pulsenet The findings and conclusions in this report are those of the authors and do not necessarily represent the official position of the Centers for Disease Control and Prevention. National Center for Emerging and Zoonotic Infectious Diseases Division of Foodborne, Waterborne, and Environmental Diseases