Introduction to NGS Analysis Tools

Similar documents
The Basics of Understanding Whole Genome Next Generation Sequence Data

The Basics of Understanding Whole Genome Next Generation Sequence Data

Introduction to PulseNet WGS Tools in BioNumerics v7.6

Developing Tools for Rapid and Accurate Post-Sequencing Analysis of Foodborne Pathogens. Mitchell Holland, Noblis

Beef Industry Safety Summit Renaissance Austin Hotel 9721 Arboretum Blvd. Austin, TX March 1-3

Validating Bionumerics 7.6: A strategic approach from Oregon

Whole Genome Sequence Data Quality Control and Validation

Introduction to CGE tools

Canada's IRIDA platform for genomic epidemiology. Gary Van Domselaar Chief, Bioinformatics National Microbiology Lab Public Health Agency of Canada

CDC s Advanced Molecular Detection (AMD) Sequence Data Analysis and Management

Bioinformatics- Data Analysis

Detecting Clusters and Reporting Results

Current status of universal whole genome sequencing of Mycobacterium tuberculosis in the United States

The implementation and application of Whole Genome Sequencing in the Campylobacter Reference Laboratory at Public Health England Craig Swift

Rue Juliette Wytsmanstraat Brussels Belgium T F

Updates from CDC: Cluster Detection and Reporting Guidelines

VTEC strains typing: from traditional methods to NGS

Bioinformatics Tools and Pipelines for Real-Time Pathogen Surveillance

Computational assembly for prokaryotic sequencing projects

Whole-Genome Sequencing (WGS) for Food Safety

From Bands to Base Pairs: Implementation of WGS in a PulseNet Laboratory

EURL WORKING GROUP ON WHOLE GENOME SEQUENCING AND PULSENET INTERNATIONAL

CGE Pipeline. Content 1. The Batch Upload 2. The Pipeline 3. The User System 4. The List Tool 5. The Map Tool 6. Exercises

Whole Genome Sequencing for Enteric Pathogen Surveillance and Outbreak Investigations

Data Intensive Biomedical Research: The EU RL VTEC efforts to take up the NGS challenge. EU RL for E. coli Annual Workshop 2015

CGE Pipeline. Content 1. The User System 2. The Batch Upload 3. The Pipeline 4. The List Tool 5. The Map Tool 6. FuturePlans 7.

Data Basics. Josef K Vogt Slides by: Simon Rasmussen Next Generation Sequencing Analysis

Using Galaxy for the analysis of NGS-derived pathogen genomes in clinical microbiology

SNP calling and VCF format

TECHNICAL REPORT. Fifth external quality assessment scheme for Listeria monocytogenes typing.

Introduction to DNA-Sequencing

De Novo Assembly of High-throughput Short Read Sequences

IFSH WHOLE GENOME SEQUENCING FOR FOOD INDUSTRY SYMPOSIUM May 22-23, 2017

From classical molecular typing to WGS in a food safety context: WGS at EFSA

Introductie en Toepassingen van Next-Generation Sequencing in de Klinische Virologie. Sander van Boheemen Medical Microbiology

Whole Genome Sequencing for food safety FSA Chief Scientific Advisor Report and 2013 Listeria pilot study

Development and Implementation of a Quality System for Next-Generation Sequencing

Practical quality control for whole genome sequencing in clinical microbiology

Bringing Whole Genome Sequencing on Board in a State Regulatory Laboratory

Analytics Behind Genomic Testing

De novo whole genome assembly

GENOME ASSEMBLY FINAL PIPELINE AND RESULTS

2014 APHL Next Generation Sequencing (NGS) Survey

GALAXY TRAKR FOR STATE PUBLIC HEALTH BIOINFORMATICS INTRODUCTORY TRAININGS, DATA ANALYTICS, & BIOINFORMATICS COLLABORATIONS

Targeted Sequencing in the NBS Laboratory

Overview of CIDT Challenges and Opportunities

ESCMID Online Lecture Library. by author

De Novo Assembly (Pseudomonas aeruginosa MAPO1 ) Sample to Insight

New York State s experience with analyzing, interpreting, and sharing whole genome sequence data for surveillance of enteric organisms.

NGS in Pathology Webinar

Illumina Sequencing Error Profiles and Quality Control

Experimental Design. Sequencing. Data Quality Control. Read mapping. Differential Expression analysis

Challenges and opportunities for whole genome sequencing based surveillance of antibiotic resistance

Fast, Accurate and Sensitive DNA Variant Detection from Sanger Sequencing:

C3BI. VARIANTS CALLING November Pierre Lechat Stéphane Descorps-Declère

Genomic epidemiology of bacterial pathogens. Sylvain BRISSE Microbial Evolutionary Genomics, Institut Pasteur, Paris

Subtyping the top 30 Salmonella serotypes using a combination of CRISPR elements and virulence genes: Salmonella CRISPR-MLVST

A year in clinical bioinformatics

Whole genome sequencing in the reference laboratory: An Introduction & Overview

Next generation sequencing in diagnostic laboratories: opportunities and challenges

Transcriptomics analysis with RNA seq: an overview Frederik Coppens

Sanger vs Next-Gen Sequencing

Introduction to RNA sequencing

Introduction to Whole Genome Sequencing and its Applications in Microbial Diagnostics

Introduction to the MiSeq

Bionano Access : Assembly Report Guidelines

From Infection to Genbank

Starting Bioinformatics from Zero as a Biologist

Whole genome and core genome multilocus sequence typing and single nucleotide

From Variants to Pathways: Agilent GeneSpring GX s Variant Analysis Workflow

DNA concentration and purity were initially measured by NanoDrop 2000 and verified on Qubit 2.0 Fluorometer.

Introduction to Whole Genome Sequencing and its Applications in Microbial Diagnostics

Introduction to Whole Genome Sequencing and its Applications in Microbial Diagnostics

Antisera QC and IQCP and Associated Challenges

by author Bacterial typing - what methodology should I use? MTE Session ECCMID 2017 VIENNA, 25 APRIL 2017 L u í s a V i e i ra P e i xe

Verocytotoxin producing Escherichia coli (VTEC) diagnostics

Setting the Course: Virginia's experience navigating information technology and bioinformatics needs for whole genome sequencing

WGS Analysis and Interpretation in Clinical and Public Health Microbiology Laboratories: What Are the Requirements and How Do Existing Tools Compare?

Next Gen Sequencing. Expansion of sequencing technology. Contents

Genome 373: Mapping Short Sequence Reads II. Doug Fowler

Verocytotoxin producing Escherichia coli (VTEC) diagnostics

Genome Assembly Software for Different Technology Platforms. PacBio Canu Falcon. Illumina Soap Denovo Discovar Platinus MaSuRCA.

Sequence Assembly and Alignment. Jim Noonan Department of Genetics

Francisco García Quality Control for NGS Raw Data

Using New ThiNGS on Small Things. Shane Byrne

Bioinformatics small variants Data Analysis. Guidelines. genomescan.nl

SEQUENCE QUALITY CONSIDERATIONS FOR THE WET LAB

Welcome to the NGS webinar series

Tutorial for Stop codon reassignment in the wild

Genome Assembly Background and Strategy

DATA FORMATS AND QUALITY CONTROL

Functional annotation of metagenomes

Read Mapping and Variant Calling. Johannes Starlinger

Comparing a few SNP calling algorithms using low-coverage sequencing data

Leonardo Mariño-Ramírez, PhD NCBI / NLM / NIH. BIOL 7210 A Computational Genomics 2/18/2015

Matthew Tinning Australian Genome Research Facility. July 2012

Computational assembly for prokaryotic sequencing projects

Whole-genome sequencing (WGS) of microbes employing nextgeneration sequencing (NGS) technologies enables pathogen

THE RISE OF WHOLE GENOME SEQUENCING AS A SUBTYPING TOOL FOR MICROBIAL SOURCE TRACKING: FROM FUNDAMENTALS TO APPLICATIONS

QIAseq Targeted Panel Analysis Plugin USER MANUAL

Transcription:

National Center for Emerging and Zoonotic Infectious Diseases Introduction to NGS Analysis Tools Heather Carleton, PhD, MPH Team Lead, Enteric Diseases Bioinformatics, Enteric Diseases Laboratory Branch, DFWED, NCEZID, CDC Next Generation Sequencing: From concept to reality at public health laboratories June 6 th, 2016 Objectives Provide a basic overview of terminology surrounding next generation sequencing data Discuss analysis terminology Highlight NGS analysis tools Command line freely available tools On line/cloud based tools Commercially available analysis tools Discuss advantages/disadvantages to the tools 1

Why do you need analysis tools: To translate WGS data Consolidation of multiple workflows in the laboratory: Identification serotyping virulence profiling antimicrobial resistance characterization subtyping Analysis Tools Assembly (de novo) whole genome MLST Analysis Functional analysis (ANI, Serotype, antimicrobial resistance profile, annotation) Sequence QC Read mapping Reference-based assembly hqsnp analysis kmer (raw read/assembly) SNP analysis wgmlst analysis 2

What is a analysis/bioinformatics pipeline? QC de novo assembly wgmlst phylogenetic tree Pipeline refers to the series of tools used to go from raw sequence data to answer Types of analysis pipelines Bioinformatics Experience Freely available command-line/ on-line cloud-based/fee for service Commercial software 3

How to pick an analysis pipeline(s) Pick the tool that fits your users If you do not have bioinformaticians in your lab than using command line tools will be a challenge Make sure the tool delivers the output you need if you need a phylogenetic tree then it needs to do read mapping, snp detection, and phylogenetic inference Must provide quality checks of raw sequence data and analysis steps so you can evaluate success of tool Analysis Tools Sequen ce QC Assembly (de novo) whole genome MLST Analysis Functional analysis (ANI, Serotype, antimicrobial resistance profile, annotation) 4

Basic QC analysis Tools used to analyze the basic quality of a sequencing run or reads generated per isolate of a sequencing run FastQC (also available in BaseSpace) Torrent Server Geneious Qiagen/CLC workbench BioNumerics v7 Sequence QC Q score 95% Q30 Quality scores likelihood the base call is correct Phred part of fastq file generated from sequencer that scores base call quality Q30 the percentage of base calls that have a 1 in 1000 chance or less of being incorrect (Q20 1 incorrect in 100 base calls) indicates whether a base call is trustworthy and can be used in a hqsnp analysis 5

Sequence QC Read trimming Assess quality over the entire read by looking at quality score by base position and % GC by base position Most NGS machines have read trimming as part of machine workflow to remove indices and adaptors Sequence Quality Insert size Insert size refers to the length of the piece of DNA you are sequencing Generally want insert size to be larger than sequencing chemistry (i.e. if doing 2x250/500 cycle sequencing want insert size larger than 500bp) Bad insert size 2x150 sequencing Good insert size 6

Sequence QC Coverage Coverage at 40x Coverage at 5x NGS generates 100,000 or more reads per one genome sequenced Any single location on the genome can have zero to hundreds of sequence reads that cover the one region Sequence Analysis De Novo Assembly Assemble raw sequence data from ~100k reads to 10 500 contigs Assemblers use different algorithms and are built to work with a specific NGS machine SPAdes, Velvet, Newbler BaseSpace/SPAdes plug in Torrent Server Geneious Qiagen/CLC workbench BioNumerics v7 7

Sequence Analysis De novo assembly Combine overlapping reads into a single contig Sequence Analysis de novo assembly quality Assembly metrics can indicate sequence quality Number of contigs raw reads assembles into Good: E. coli <200, Salmonella < 100, Listeria < 30 N50 statistic Calculated by summarizing the lengths of the biggest contigs until you reach 50% of total combined contig length Good: >200,000 bp 3 Million base pair genome (determined by sum of contig lengths) 750,000bp 500,000bp 350,000bp *N50 is 350,000 bp Indicates 1.5 Million base pairs, or cutoff for 50% combined contig length (N50) 8

Sequence Analysis Multi locus sequence typing Locus can be a gene or part of a gene any change (single nucleotide polymorphism, insertion, deletion, small inversion) is a new allele number Loci can cover the whole genome of an isolate, the core (in common) genes of a species, or house keeping genes of a genus (traditional MLST) cgmlst hq SNP Sequence Analysis MLST Comparing number (character) differences between isolates Requires an already developed scheme for the analyzed organism NCBI Pathogen pipeline (in development) BigsDB (http://pubmlst.org/software/database/bigsdb/) Ridom/SeqSphere (http://www.ridom.com/seqsphere/) BioNumerics v7 9

Sequence analysis functional annotation Predict isolate characteristics from WGS data (genus/species, serotype, antimicrobial resistance, virulence, etc.) NCBI Pathogen pipeline (antimicrobial resistance) Center for Genomic Epidemiology (CGE) (virulence, STEC/ Salmonella serotype, antimicrobial resistance) BioNumerics v7 (genus/species (ANIm), virulence, STEC/Salmonella serotype, antimicrobial resistance) Identifying Genus and Species from WGS data Can use databases MLST, ribosomal MLST, 16S to identify Genus and occasionally to species level Can use WGS methods similar to classic laboratory methods for identification, DNA DNA hybridization, to calculate Average Nucleotide Identity (ANI) between a query genome and a reference genome E. coli ACTAGAGGGAAA S. enterica GCATCCCCCGTT GCATCCCCCGTA query genome ANI score 98% for S.enterica 10

Inferring serotype from WGS Since the genes that code the O and H antigens and determine serotype are known can build a database that translates sequence to serotype Limitations Sometimes genes are not expressed (non motile isolates) There may be modifications to the antigen protein that are not encoded in the genes that originally make the protein Virulence factors from WGS data Virulence factors like Shiga toxin or other enterotoxins that are traditionally detected by serology, PCR, or real time PCR can be detected in WGS data using databases Publically available resources like the Center of Genomic Epidemiology VirulenceFinder can be used to find virulence genes in E. coli, Enterococcus, and S. aureus http://www.genomicepidemiology.org/ 11

Predicting antimicrobial resistance from WGS Acquired resistance Usually resistance genes (200bp 1,000bp) Highly conserved even between different genera (>98% identity) Usually located on mobile elements (plasmids, integrons, islands) Methods to detect assembled sequence, resistance databases (Resfinder, ARG ANNOT, FDA/NCBI AR database) Acquired Resistance Genes associated with a particular AR phenotype Phenotype Ampicillin Amoxicillin/ clavulanic acid Cefoxitin Ceftriaxone Ceftiofur Kanamycin Gentamicin Streptomycin Chloramphenicol Sulfisoxazole Trimethoprim/ sulphamethoxazole Tetracycline Genotype bla cmy 2 aph(3 ) Ia aac(3) VIa aada2, strab flor sul1, sul2 dfra12, sul1, sul2 teta 12

Predicting antimicrobial resistance from WGS Mutational resistance Usually SNPs, but can be insertions/deletions Usually chromosomal Genera or species specific Methods no available databases assembled sequence, in silico PCR raw reads, SNP analysis Analysis Tools Reference-based assembly Sequen ce QC Read mapping hqsnp analysis Functional analysis (ANI, Serotype, antimicrobial resistance profile, annotation) 13

Sequence Analysis Read mapping/ hqsnp analysis Map raw sequence data to a known reference genome Pick mapper based on sequencing chemistry and organism (diploid/haploid) Mapping used for downstream analysis including hqsnp samtools, bowtie2, smalt (can wrap some of these in Galaxy) BaseSpace (bacterial, viral, human, and cancer variant apps), torrent server NCBI pathogen pipeline BioNumerics v7, CLC Genome workbench, Geneious Sequence Analysis high quality single nucleotide polymorphisms (hqsnps) Sequence Reads Sequence reads Sequence reads What makes a SNP high quality (hq)? Apply a quality filter that filters out nucleotides in sequence reads for comparison based on sequence coverage, quality, location Quality filtered Sequence Reads ready for analysis 14

What to call a SNP SNPs called based on: Quality Coverage Base frequency The differences between the reference and compared genome are extracted and used to determine relatedness ATGTTACTC ATGTTCCTC ATGTTCCTC ATGTTCCTC ATGTTCCTC ATGTTCCTC ATGTTTCTC ATGTTCCTC ATGTTCCTC ATGTTCCTC ATGTTGCTC ATGTTCCTC ATGTTCCTC ATGTTCCTC ATGTTGCTC reference Is it a SNP? Where to call a SNP? Mobile elements genes Raw reads Mask mobile elements -do no consider SNPs in this location Only call SNPs in genes Not all SNP pipelines are equal where you call SNPs will affect the total SNP count SNPs relevant for phylogenetic analysis are vertically transmitted, not horizontally, so horizontal genetic elements like phages can be masked 15

Where to call a SNP pick the right reference Choice of reference genome affects analysis more closely related reference more likely to identify true SNP differences How to interpret hqsnps phylogenetic trees Use the differences you identified by hqsnp to infer the relatedness or phylogeny of isolates actgaatta 3 ggagaatta 1 ggataatta 1 1 ggattatta ggagagtta 6 Isolate C Isolate A ggatccccc Isolate B 5 actgccggt Isolate D genetic change 16

NCBI Pathogen Detection Pipeline NCBI Submission Portal BioProject BioSamples SRA GenBank NCBI Pathogen Pipeline QC Kmer analysis Genome Assembly Genome Annotation Genome Placement Clustering SNP analysis Tree Construction Reports Automated Bacterial Assembly Reference Distance tree SRA Reads sample 1 Trim reads (Ns, adaptor) Find closest reference genome(s) De novo assembly panel Argo (Reference assisted assembly) SOAP denovo MaSuRCA SPAdes GS-assembler (newbler) Celera Assembler ArgoCA (Combined Assembly) Reads remapped to combined assembly Contig fasta Read placements (bam) Quality profile 17

Results Available Now http://www.ncbi.nlm.nih.gov/pathogens/ NCBI Pathogen Detection SNP Pipeline: example 1 - stone fruit outbreak 18

CDC SNP extraction tool Lyve SET Developed for analysis of raw sequence data from foodborne pathogens Works with both ion torrent and illumina data (need to use 2 different mappers Can filter based on quality and clustered SNPs and filter out phages automatically https://github.com/lskatz/lyve SET Clean raw reads cg-pipeline Map reads to reference SMALT Identify SNPs Varscan Create phylogeny RaxML SNP matrix pairwise differences phylogenetic tree FDA SNP pipeline SNP pipeline Developed for analysis of sequence data for foodborne pathogens Excellent documentation online http://snp pipeline.readthedocs.io/en/latest/ https://github.com/cfsan Biostatistics/snp pipeline Map reads to reference Bowtie2 Identify SNPs Varscan SNP matrix pairwise differences Output for phylogenetic analysis 19

Analysis Tools Sequence QC kmer (raw read based/assembly) SNP analysis wgmlst analysis Sequence analysis reference free raw read and assembly based approaches Analysis does not require a reference Can use kmer based analyses to measure relatedness between isolates Can also use to fast match against a known allele/reference ksnp (https://sourceforge.net/projects/ksnp/), MASH NCBI pathogen pipeline (kmer tree) Center for genomic epidemiology CLC genome workbench BioNumerics v7.5 wgmlst 20

Kmer based analysis Computer algorithms use a sliding window to chop up sequence reads into shorter lengths (k) of DNA kmers kmers are compared to identify differences Read (15bp) ACTGAACTGACTCAA ACTGAACTGACTCAC K-mer (10bp) ACTGAACTGA CTGAACTGAC TGAACTGACT AACTGACTCA ACTGACTCAA Identical K-mers Unique K-mer ACTGAACTGA CTGAACTGAC TGAACTGACT AACTGACTCA ACTGACTCAC Isolate 1 Isolate 2 KSNP based analysis Computer algorithms use a sliding window to chop up sequence reads into shorter lengths (k) of DNA k is always an odd number Raw Read (15bp) Compare base pair differences at central position of kmer ACTGAACTGACTCAA ACTGCACTGACTCAA K-mer (9bp) ACTGAACTG CTGAACTGA TGAACTGAC AACTGACTC ACTGACTCA ACTGCACTG CTGCACTGA TGCACTGAC CACTGACTC ACTGACTCA Isolate 1 Isolate 2 21

Kmer analysis identifying organisms End to End Analysis Tools Assembly (de novo) whole genome MLST Analysis Functional analysis (ANI, Serotype, antimicrobial resistance profile, annotation) Sequence QC Read mapping Reference-based assembly hqsnp analysis kmer (raw read/assembly) SNP analysis wgmlst analysis 22

Tools that offer end to end solutions: BioNumerics v7.6 Tools for QC, assembly, wgmlst, hqsnp, functional prediction in each single button workflows Functions as a database so the metadata needed to interpret the analysis is easily viewable For bacteriology, virology, mycology, animals, and plants Tools that offer end to end solutions: CLC Genomics Has tools to handle haploid and diploid genomes Nice graphics and reporting features Can export workflows for others to use 23

Tools that offer end to end solutions: Illumina BaseSpace Conclusions: Pick the tool that fits your need Think about whether you will be doing CLIA or CAP certified tests through the pipeline and what kind of control and customization you need Make sure your laboratorians can use the tool and interpret the output 24

Questions? Use of trade names is for identification only and does not imply endorsement by the Centers for Disease Control and Prevention or the U.S. Department of Health and Human Services. For more information, contact CDC 1 800 CDC INFO (232 4636) TTY: 1 888 232 6348 www.cdc.gov The findings and conclusions in this report are those of the authors and do not necessarily represent the official position of the Centers for Disease Control and Prevention. Resources: Program What for? Where to find it Cost? Platform BioNumerics 7.5 CLC Bio Genomics Workbench Geneious Assembly, wgmlst, SNP analysis Workflows, read metrics, assemblies, etc, SNP analyses Assemblies, trees, SNP analysis http://www.appliedmaths.com/ Yes Yes Windows Windows/ Linux http://geneious.com/ Yes Windows MEGA6 Phylogenies megasoftware.net/ No Windows Lasergene Assemblies, read metrics, http://www.dnastar.com/ Yes Windows analysis NCBI Genome Workbench CFSAN SNP pipeline Snp Extraction Tool Viewing trees, analysis Assembly, read metrics, assembly metrics, read cleaning, etc Read cleaning, Creating Phylogenies http://www.ncbi.nlm.nih.gov/t ools/gbench/ https://www.qiagenbioinform atics.com/products/clcgenomics-workbench/ sourceforge.net/projects/cgpipeline No No Windows/ Linux Linux github.com/lskatz/lyve-set No Linux 25