Computational assembly for prokaryotic sequencing projects

Similar documents
Computational assembly for prokaryotic sequencing projects

Introduction to NGS Analysis Tools

De Novo Assembly of High-throughput Short Read Sequences

CDC s Advanced Molecular Detection (AMD) Sequence Data Analysis and Management

IFSH WHOLE GENOME SEQUENCING FOR FOOD INDUSTRY SYMPOSIUM May 22-23, 2017

Mate-pair library data improves genome assembly

Sequence assembly. Jose Blanca COMAV institute bioinf.comav.upv.es

2014 APHL Next Generation Sequencing (NGS) Survey

Genome Assembly. J Fass UCD Genome Center Bioinformatics Core Friday September, 2015

Introductie en Toepassingen van Next-Generation Sequencing in de Klinische Virologie. Sander van Boheemen Medical Microbiology

Sequencing technologies. Jose Blanca COMAV institute bioinf.comav.upv.es

Sequencing Theory. Brett E. Pickett, Ph.D. J. Craig Venter Institute

Sequencing technologies. Jose Blanca COMAV institute bioinf.comav.upv.es

Gap Filling for a Human MHC Haplotype Sequence

Introduction to Bioinformatics

New York State s experience with analyzing, interpreting, and sharing whole genome sequence data for surveillance of enteric organisms.

Introduction to RNA sequencing

Antisera QC and IQCP and Associated Challenges

Infectious Disease Omics

De novo Genome Assembly

Sanger vs Next-Gen Sequencing

De novo whole genome assembly

Genomics AGRY Michael Gribskov Hock 331

ALGORITHMS IN BIO INFORMATICS. Chapman & Hall/CRC Mathematical and Computational Biology Series A PRACTICAL INTRODUCTION. CRC Press WING-KIN SUNG

Assembly of Ariolimax dolichophallus using SOAPdenovo2

From Variants to Pathways: Agilent GeneSpring GX s Variant Analysis Workflow

Sequence Assembly and Alignment. Jim Noonan Department of Genetics

Introduction: Methods:

A Roadmap to the De-novo Assembly of the Banana Slug Genome

Chromosome-scale scaffolding of de novo genome assemblies based on chromatin interactions. Supplementary Material

De novo genome assembly with next generation sequencing data!! "

MEEPTOOLS: A maximum expected error based FASTQ read filtering and trimming toolkit

A shotgun introduction to sequence assembly (with Velvet) MCB Brem, Eisen and Pachter

Course Presentation. Ignacio Medina Presentation

Mapping strategies for sequence reads

WGS Analysis and Interpretation in Clinical and Public Health Microbiology Laboratories: What Are the Requirements and How Do Existing Tools Compare?

Lees J.A., Vehkala M. et al., 2016 In Review

NCBI web resources I: databases and Entrez

Welcome to the NGS webinar series

Next Gen Sequencing. Expansion of sequencing technology. Contents

The first generation DNA Sequencing

Leonardo Mariño-Ramírez, PhD NCBI / NLM / NIH. BIOL 7210 A Computational Genomics 2/18/2015

RADSeq Data Analysis. Through STACKS on Galaxy. Yvan Le Bras Anthony Bretaudeau Cyril Monjeaud Gildas Le Corguillé

Next-Generation Sequencing. Technologies

Practical quality control for whole genome sequencing in clinical microbiology

Pathogenic organisms no thanks: Use of next generation sequencing techniques in risk assessment and HACCP

RNA-seq Data Analysis

Genome-wide association studies (GWAS) Part 1

An Investigation of Palindromic Sequences in the Pseudomonas fluorescens SBW25 Genome Bachelor of Science Honors Thesis

Efficiency in Next-Generation Sequencing for Public Health

Genome Assembly Workshop Titles and Abstracts

Informatics of Clinical Genomics

Ecole de Bioinforma(que AVIESAN Roscoff 2014 GALAXY INITIATION. A. Lermine U900 Ins(tut Curie, INSERM, Mines ParisTech

Bioinformatics Advice on Experimental Design

MicroSEQ Rapid Microbial Identification System

Long and short/small RNA-seq data analysis

Top 5 Lessons Learned From MAQC III/SEQC

Next Generation Sequencing Technologies

Illumina (Solexa) Throughput: 4 Tbp in one run (5 days) Cheapest sequencing technology. Mismatch errors dominate. Cost: ~$1000 per human genme

Title: Genome sequence of lineage III Listeria monocytogenes strain HCC23

Variant detection analysis in the BRCA1/2 genes from Ion torrent PGM data

BLAST. compared with database sequences Sequences with many matches to high- scoring words are used for final alignments

Typically, to be biologically related means to share a common ancestor. In biology, we call this homologous

Targeted Sequencing in the NBS Laboratory

Genomic DNA ASSEMBLY BY REMAPPING. Course overview

Bioinformatics. Ingo Ruczinski. Some selected examples... and a bit of an overview

Next Generation Sequencing Technologies

Hybrid Error Correction and De Novo Assembly with Oxford Nanopore

Next Generation Sequencing: An Overview

Technical meeting on the Impact of Whole Genome Sequencing on food safety management

CNV and variant detection for human genome resequencing data - for biomedical researchers (II)

Why can GBS be complicated? Tools for filtering, error correction and imputation.

1.1 Post Run QC Analysis

Analysis Report. Institution : Macrogen Japan Name : Macrogen Japan Order Number : 1501APB-0004 Sample Name : 8380 Type of Analysis : De novo assembly

Bioinformatics in Public Health Laboratories Duncan MacCannell PhD

Runs of Homozygosity Analysis Tutorial

Introduction to Bioinformatics

ELE4120 Bioinformatics. Tutorial 5

ABSTRACT COMPUTATIONAL METHODS TO IMPROVE GENOME ASSEMBLY AND GENE PREDICTION. David Kelley, Doctor of Philosophy, 2011

Genome Sequencing. I: Methods. MMG 835, SPRING 2016 Eukaryotic Molecular Genetics. George I. Mias

Fundamentals of Next-Generation Sequencing: Technologies and Applications

NOW GENERATION SEQUENCING. Monday, December 5, 11

Bioinformatics and computational tools

Axiom mydesign Custom Array design guide for human genotyping applications

BIOINFORMATICS 1 SEQUENCING TECHNOLOGY. DNA story. DNA story. Sequencing: infancy. Sequencing: beginnings 26/10/16. bioinformatic challenges

Workflow of de novo assembly

Glossary of Commonly used Annotation Terms

Title: High-quality genome assembly of channel catfish, Ictalurus punctatus

arxiv: v1 [q-bio.gn] 25 Nov 2015

Whole Transcriptome Analysis of Illumina RNA- Seq Data. Ryan Peters Field Application Specialist

Using the Potato Genome Sequence! Robin Buell! Michigan State University! Department of Plant Biology! August 15, 2010!

Phylogenetic methods for taxonomic profiling

What is the Difference between a Human and a Chimp? T. M. Murali

What the Genome of Raffaelea lauricola Can Tell Us About Laurel Wilt

NGS sequence preprocessing. José Carbonell Caballero

COPE: An accurate k-mer based pair-end reads connection tool to facilitate genome assembly

Next Generation Sequencing Lecture Saarbrücken, 19. March Sequencing Platforms

B I O I N F O R M A T I C S

Transcription:

Computational assembly for prokaryotic sequencing projects Lee Katz, Ph.D. Bioinformatician, Enteric Diseases Laboratory Branch (EDLB) Enteric Diseases Bioinformatics Team (EDBiT) January 30, 2017 Disclaimers The findings and conclusions in this presentation have not been formally disseminated by the Centers for Disease Control and Prevention and should not be construed to represent any agency determination or policy. The findings and conclusions in this [report/presentation] are those of the author(s) and do not necessarily represent the official position of CDC

April 16, 2008 2 http://www.compgenomics.biology.gatech.edu/index.php/group_photos

3

April 23, 2008 4

2011 to present Vibrio, Campylobacter, Escherichia, Shigella, Yersinia, Salmonella ENTERIC DISEASES LABORATORY BRANCH 5

August 29, 2014 Thanks to John Besser for letting me borrow this slide 6

PulseNet Detected Outbreaks 2014 (so far) Thanks to John Besser for letting me borrow this slide 7

Illumina sequencing Reads Quality control (Q/C) Read metrics Read-cleaning Assembly Algorithms Assembly metrics Genome-based typing Outline Please also see my talk last year at http://compgenomics2016.biology.gatech.edu/index.php/lectures 8

Dilbert 2013-05-23 9

Prokaryotic Sequencing Projects Stages Sequencing Assembly Feature prediction Functional annotation analysis Display (Genome Browser) Examples Haemophilus influenzae Neisseria meningitidis Bordetella bronchisceptica Vibrio cholerae Listeria monocytogenes Sequencing Assembly prediction annotation Display Fleischman et al. (1995) Whole-Genome Random Sequencing and Assembly of Haemophilus influenzae Rd Science 269:5223 Kislyuk et al. (2010) A computational genomics pipeline for prokaryotic sequencing projects Bioinformatics 26:15

Illumina sequencing (2 nd Gen) You have sequences from the Illumina HiSeq 250bp Paired end 11

Paired end Illumina reads Ideal Still good Bad 12

Q/C + cleaning READS 13

Q/C You need to know if your data are good! Example software that give you read metrics FastQC FastqQC (from AMOS) Computational Genomics Pipeline (CG-Pipeline) run_assembly_readmetrics.pl BamUtils stats 14

Quality Control 15 FastQC output

Quality Control 16 FastQC output

The CG-Pipeline way run_assembly_readmetrics.pl File avgreadlength totalbases minreadlength maxreadlength avgquality tmp.fastq 80.00 177777760 80 80 35.39 17

READ CLEANING 18

Why clean my reads? Your reads have errors These errors can hurt downstream analysis Can be categorized into Trimming Filtering Correcting 19

More read cleaning and correction Software CGP documentation: http://cg-pipeline.sourceforge.net/wiki Fastx toolkit http://hannonlab.cshl.edu/fastx_toolkit EA-utils https://code.google.com/p/ea-utils AMOS amos.sourceforge.net Sickle https://github.com/najoshi/sickle Quake http://www.cbcb.umd.edu/software/quake BayesHammer http://bioinf.spbau.ru/spades/bayeshammer PrinSeq Blue - http://www.bioinformatics.csiro.au/blue/ and more is out there! Evaluation Fabbro et al 2013, An extensive evaluation of read trimming effects on Illumina NGS data analysis 20

ASSEMBLY CONCEPTS 21

Assembly Overlaps between reads or between nodes in a graph Generate contigs (contiguous sequences) Generate scaffolds NNN N 22

Further reading: Zhang et al 2011 Plos One; Miller et al 2011 Genomics 23

Overlap-layout-consensus Great at longer, fewer, imperfect reads, e.g., Ion Torrent, 454, PacBio Stephen Schlebusch, Nicola Illing. Next generation shotgun sequencing and the challenges of de novo genome assembly. 24 South African Journal of Science. Vol. 108

Derive consensus sequence TAGATTACACAGATTACTGA-TTGATGGCGTAA-CTA TAGATTACACAGATTACTGACTTGATGGCGTAAACTA TAG-TTACACAGATTATTGACTTCATGGCGTAA-CTA TAGATTACACAGATTACTGACTTGATGGCGTAA-CTA TAGATTACACAGATTACTGACTTGATGGGGTAA-CTA TAGATTACACAGATTACTGACTTGATGGCGTAA-CTA Derive each consensus base by weighted voting Slide adapted from Andrey Kislyuk, http://www.compgenomics2009.biology.gatech.edu/images/1/12/2009-01-14- compgenomics-kislyuk.pdf 25

De bruijn (kmer) Great at shorter, numerous, less error-prone reads, e.g., Illumina http://www.homolog.us/blogs/blog/2012/06/17/an-intuitive-explanation-for-running-de-bruijn-assembler-with-varyingk-mer-sizes/ 26

Another way to look at de bruijn assembly A (1) D (1) B (2) C (3) F (1) E (2) G (1) Unipath graph of the 1.8-Mb genome of C. jejuni Possible paths: ABCDBCEFCEG ABCEFCDBCEG 27 Butler J et al. (2008) ALLPATHS: De novo assembly of whole-genome shotgun microreads. Genome Res. 18:810-820

Another way to look at de bruijn assembly A (1) D (1) B (2) C (3) F (1) E (2) G (1) Unipath graph of the 1.8-Mb genome of C. jejuni Possible paths: ABCDBCEFCEG ABCEFCDBCEG 28 Butler J et al. (2008) ALLPATHS: De novo assembly of whole-genome shotgun microreads. Genome Res. 18:810-820

Recap of assembly reads Paired end reads contigs Scaffold NNNNNNNNNN 29

ASSEMBLY EVALUATION 30

Assembly Metrics How do you tell if your assembly is good? Metric description reference Assembly Length Number of Contigs N50 Longest Contig Average contig length The size of the concatenated assembly The count of contigs The size of the contig at where half the genome is located in size >N50 and half is located in size <N50 Kmer21 Frequency of kmers with k=21 http://www.homolog.us/blogs/blog /2012/06/26/what-is-wrong-withn50-how-can-we-make-it-betterpart-ii/ GC-content Assembly score Percentage of the genome that is either G or C log N50 percentgenomecovered numcontigs CG-Pipeline/Lee Katz 31

Please also see my talk last year at http://compgenomics2016.biology.gat ech.edu/index.php/lectures 32

The Far Side Totally changing the topic GENOME-BASED TYPING 33

Can we use WGS with epidemiology? Phylogeny* approximates** epidemiology***! * Depending on your evolutionary model ** In a monoclonal outbreak *** A trace back diagram Epidemiology - linking cases through a common food vehicle Phylogeny how things are genetically related through evolution

Listeria real-time project Sequenced every Listeria monocytogenes in the US Starting August 2013 Clinical & environmental (CDC/FDA/USDA) Put into large tree (NCBI) Mile-high view 7,800 Listeria monocytogenes genomes in this tree NCBI tree and Genome Workbench. http://www.ncbi.nlm.nih.gov/tools/gbench ftp://ftp.ncbi.nlm.nih.gov/pathogen/results/listeria/pdg000000001.611/ 35

The flood of data is upon us Sept 23, 2016: 91k genomes 10 taxonomic groups Jan 27, 2015: 121k genomes 19 taxonomic groups * for foodborne and hospital-acquired pathogens 36

Are two isolates in the same outbreak? Need to refine NCBI analysis Can use many methods MinHash clustering with kmers wgmlst SNPs Assembly-based (faster) Raw-read based (resolution) 37

MinHash Kmer: a length of DNA k nucleotides long 1. Shred all reads into kmers 2. How many kmers are in common? 3. Transform into a jaccard distance Gives a genomic distance With the distances, make tree Fast, not phylogenetic Analogy: comparing equal-sized shreds of paper from two books Image credits: DEATH OF A SHREDDER https://digginginthedriftless.com/2011/01/04/death-of-a-shredder NCBI: http://www.ncbi.nlm.nih.gov/pathogens Mashtree: https://github.com/lskatz/mashtree Mash: http://mash.readthedocs.io/en/latest Sourmash: https://github.com/dib-lab/sourmash MinHash algorithm explained: Ondov et al (2016) Mash: fast genome and metagenome distance estimation using MinHash Genome Biology 38

MLST MLST: multilocus sequence typing Locus: a place in a genome. Plural: loci Identify a set of loci in the genome Compare each locus in a genome against the set of loci Count differences and the number of loci compared Analogy: comparing two books, page by page Different kinds 7-gene MLST rmlst wgmlst (whole genome MLST) cgmlst (core genome MLST) and more StringMLST https://github.com/anujg1991/stringmlst SRST2 https://github.com/katholt/srst2 BioNumerics http://www.appliedmaths.com/applications/mlst mlst https://github.com/tseemann/mlst Salmonella 7-gene scheme http://mlst.warwick.ac.uk/mlst/dbs/senterica Moura, Alexandra, et al. "Whole genome-based population biology and epidemiological surveillance of Listeria monocytogenes." Nature Microbiology 2 (2016): 16185. Jackson, Brendan R., et al. "Implementation of nationwide real-time whole-genome sequencing to enhance listeriosis outbreak detection and investigation." Clinical Infectious Diseases 63.3 (2016): 380-386. Maiden, Martin CJ, et al. "Multilocus sequence typing: a portable approach to the identification of clones within populations of pathogenic microorganisms." Proceedings of the National Academy of Sciences 95.6 (1998): 3140-3145.

SNPs SNP: Single nucleotide polymorphism hqsnp: high-quality SNP Compare individual letters in a query genome against the reference genome Analogy: comparing individual letters between two books Lyve-SET github.com/lskatz/lyve-set SNP-Pipeline (CFSAN) http://snp-pipeline.readthedocs.org ksnp https://sourceforge.net/projects/ksnp SPANDx https://github.com/dsarov/spandx Snippy https://github.com/tseemann/snippy NASP http://tgennorth.github.io/nasp Red Dog https://github.com/katholt/reddog Harvest https://harvest.readthedocs.io MUMmer http://mummer.sourceforge.net/manual SNVPhyl http://snvphyl.readthedocs.io 40 RealPhy http://realphy.unibas.ch

Which algorithm should you use? MinHash wgmlst hqsnp Diversity Outbreak-level resolution Further genomic information Minimal upfront effort Fast Easy-to-use for anyone Easy-to-use for bioinformaticians 41

https://github.com/lskatz SOFTWARE BY LEE 42

Lyve-SET Lyve Listeria, Yersinia, Vibrio, and Enterobacteriaceae reference lab SET Snp Extraction Tool Some details on Lyve-SET: For LINUX Extensive documentation Help options are embedded in each script --fast option, takes ¼ the normal time Easy to use Modular Most stable version: 1.1.4f Newest version: 2.0 https://github.com/lskatz/lyve-set 43

An animation of Lyve-SET 0. Pre-processing a) phage discovery/masking b) Manual identification of troublesome regions c) Read cleaning (Poster Wagner et al) 1. Mapping - SMALT phage Manual identification Reference genome Genome 1 SNP profile a) 95% read identity b) Unambiguous mapping 2. SNP calling - VarScan Genome 2 SNP profile Genome 3 SNP profile Genome 4 SNP profile a) 75% consensus b) 10x depth 3. Phylogeny inferring RAxML v8 a) Removal of clustered SNPs b) Ascertainment bias model 100 100 2014C- 100 2014C-3 2014C-3 2014C- 2014C- 2014C-3 2014C-3 100 2014C-3 2014C-3 Phylogeny c) Maximum likelihood 0.001 https://github.com/lskatz/lyve-set https://www.sanger.ac.uk/resources/software/smalt Koboldt, D., Zhang, Q., Larson, D., Shen, D., McLellan, M., Lin, L., Miller, C., Mardis, E., Ding, L., & Wilson, R. (2012). VarScan 2: Somatic mutation and copy number alteration discovery in cancer by exome sequencing Genome Research DOI: 10.1101/gr.129684.111 Stamatakis, A. (2014) RAxML version8: a tool for phylogenetic analysis and post-analysis of large phylogenies. Bioinformatics. doi:10.1093/bioinformatics/btu033

Lyve-SET method with more details Katz et al 2017, Frontiers in Microbiology. In review. 45

Comparison with other SNP pipelines Each data point is a SNP distance as determined by Lyve-SET (x-axis) and the distance of an alternative SNP pipeline (y-axis). The slope indicates the number of SNPs per Lyve-SET SNP. 46 Katz et al 2017, Frontiers in Microbiology. In review.

Comparison with whole-genome MLST (Listeria monocytogenes only) Katz et al 2017, Frontiers in Microbiology. In review. 47

Comparison with other SNP Pipelines and wgmlst Lyve-SET ksnp RealPhy Snp-Pipeline SNVPhyl wgmlst Tree sensitivity (Sn) a 100.0% 100.0% 100.0% 100.0% 100.0% 100.0% Tree specificity (Sp) a 100.0% 90.2% 100.0% 100.0% 100.0% 100.0% average of Sn and Sp 100.0% 95.1% 100.0% 100.0% 100.0% 100.0% Kendall-Colijn (λ = 0) b - 1.26E-02 7.51E-03 9.28E-03 9.15E-02 1.00E-04 Robinson-Foulds b - 3.16E-69 6.79E-40 5.39E-74 9.61E-49 1.55E-147 Mantel - 0.60 0.77 0.77 0.79 0.74 SNP ratio c,d - 0.53, 0.78 0.97, 0.84 1.61, 1.75 0.67, 0.84 0.69, 0.72 goodness-of-fit (R 2 ) d - 0.46, 0.42 0.7, 0.75 0.77, 0.3 0.83, 0.68 0.75, 0.72 Genome analyzed e 25.9% 0.1% 84.8% 0.3% 82.1% 88.2% a) Average percentage from 11 outbreaks. The S. enterica outbreak 1203NYJAP-1 was removed as an outlier because all pipelines except wgmlst produced errors with grouping outbreak vs non-outbreak isolates. Therefore this dataset was removed from the Sn and Sp calculations as an outlier. b) Geometric mean. c) Number of SNPs per Lyve-SET SNP, averaged across 12 outbreaks. For wgmlst, this is the number of alleles per Lyve-SET SNP. d) The average for 12 outbreaks. First value is for all data points; second value is for distances between only outbreakassociated genomes. e) The average for 12 outbreaks. Percentage of the reference genome included for analysis. For wgmlst, the average percentage was calculated by obtaining each GenBank-formatted file with annotated wgmlst loci and calculating the breadth of coverage for all loci. Katz et al 2017, Frontiers in Microbiology. In review. Comparison metrics: Kendall and Colijn 2015, A tree metric using structure and length to capture distinct phylogenetic signals. arxiv Mantel 1967, The detection of disease clustering and a generalized regression approach Cancer Res Robinson and Foulds 1981, Comparison of phylogenetic trees. Mathematical biosciences 48

Lyve-SET Lyve Listeria, Yersinia, Vibrio, and Enterobacteriaceae reference lab SET Snp Extraction Tool Who uses Lyve-SET? Used in labs at CDC: Foodborne Diseases Laboratory Branch - EDBIT, Reference Lab (NERO), PulseNet, NARMS, Botulism Lab Clinical and Environmental Microbiology Branch Special Bacterial Pathogens Branch Respiratory Diseases Branch Meningitis and Vaccine Preventable Diseases Branch State health labs - Colorado, Washington USDA - FSIS, ARS https://github.com/lskatz/lyve-set 49

Mashtree Useful for creating a dendrogram quickly Does not create a phylogeny https://github.com/lskatz/ mashtree 50

Mashtree comparison RealPhy v112 Snp-Pipeline v0.5.2 SNVPhyl v1.0 wgmlst BioNumeric s v7.5 ksnp v3.0.0 Lyve-SET v1.1.4f VA-WGS-00229 CFSAN010096 VA-WGS-00222 VA-WGS-00241 VA-WGS-00227 PNUSAL000569 CFSAN010093 CFSAN010097 CFSAN010077 PNUSAL000570 PNUSAL000571 CFSAN010095 VA-WGS-00234 VA-WGS-00226 CFSAN010089 CFSAN010088 VA-WGS-00237 VA-WGS-00230 VA-WGS-00223 VA-WGS-00233 PNUSAL000517 PNUSAL000520 VA-WGS-00235 VA-WGS-00221 VA-WGS-00231 VA-WGS-00225 CFSAN010098 VA-WGS-00240 FS-100 VA-WGS-00228 CFSAN010094 CFSAN010092 VA-WGS-00224 VA-WGS-00238 VA-WGS-00239 VA-WGS-00236 FS-106 NYAG12B11863-1 PNUSAL000355 PNUSAL001405 PNUSAL001406 PNUSAL001431 PNUSAL001433 PNUSAL001432 PNUSAL001434 PNUSAL001473 0.1 PNUSAL000570 VA-WGS-00228 CFSAN010094 VA-WGS-00241 VA-WGS-00240 VA-WGS-00235 VA-WGS-00227 VA-WGS-00239 VA-WGS-00223 VA-WGS-00233 FS-106 CFSAN010088 CFSAN010093 VA-WGS-00222 VA-WGS-00238 CFSAN010092 VA-WGS-00224 PNUSAL000569 PNUSAL000571 CFSAN010097 PNUSAL000517 VA-WGS-00234 CFSAN010095 VA-WGS-00236 CFSAN010077 CFSAN010096 VA-WGS-00230 CFSAN010089 VA-WGS-00226 VA-WGS-00237 VA-WGS-00231 VA-WGS-00221 VA-WGS-00225 CFSAN010098 FS-100 PNUSAL000520 VA-WGS-00229 NYAG12B11863-1 PNUSAL000355 PNUSAL001473 PNUSAL001405 PNUSAL001406 PNUSAL001433 PNUSAL001432 PNUSAL001431 PNUSAL001434 0.01 PNUSAL000520 VA-WGS-00230 VA-WGS-00234 VA-WGS-00235 CFSAN010077 CFSAN010093 FS-100 NYAG12B11863-1 CFSAN010094 VA-WGS-00222 VA-WGS-00228 VA-WGS-00236 CFSAN010092 VA-WGS-00238 VA-WGS-00224 CFSAN010088 VA-WGS-00233 VA-WGS-00223 VA-WGS-00240 VA-WGS-00227 VA-WGS-00237 VA-WGS-00221 VA-WGS-00231 VA-WGS-00225 VA-WGS-00239 CFSAN010098 VA-WGS-00226 CFSAN010089 CFSAN010097 CFSAN010095 FS-106 VA-WGS-00229 CFSAN010096 VA-WGS-00241 PNUSAL000571 PNUSAL000517 PNUSAL000569 PNUSAL000570 PNUSAL000355 PNUSAL001473 PNUSAL001405 PNUSAL001406 PNUSAL001432 PNUSAL001433 PNUSAL001434 PNUSAL001431 0.005 CFSAN010096 VA-WGS-00229 FS-106 CFSAN010077 CFSAN010095 PNUSAL000571 VA-WGS-00235 PNUSAL000517 VA-WGS-00228 VA-WGS-00227 VA-WGS-00237 VA-WGS-00236 VA-WGS-00239 CFSAN010093 PNUSAL000569 VA-WGS-00222 CFSAN010098 CFSAN010097 FS-100 PNUSAL000520 VA-WGS-00234 VA-WGS-00241 CFSAN010088 VA-WGS-00223 VA-WGS-00233 VA-WGS-00240 VA-WGS-00221 VA-WGS-00225 VA-WGS-00231 CFSAN010089 VA-WGS-00226 CFSAN010092 VA-WGS-00238 VA-WGS-00224 VA-WGS-00230 PNUSAL000570 CFSAN010094 NYAG12B11863-1 PNUSAL000355 PNUSAL001473 PNUSAL001433 PNUSAL001434 PNUSAL001405 PNUSAL001406 PNUSAL001431 PNUSAL001432 0.1 VA-WGS-00241 CFSAN010097 VA-WGS-00223 VA-WGS-00233 CFSAN010098 VA-WGS-00222 VA-WGS-00240 VA-WGS-00221 VA-WGS-00225 VA-WGS-00231 CFSAN010093 CFSAN010094 CFSAN010077 CFSAN010095 FS-106 CFSAN010096 VA-WGS-00229 CFSAN010092 VA-WGS-00224 VA-WGS-00238 VA-WGS-00228 VA-WGS-00227 VA-WGS-00237 VA-WGS-00226 CFSAN010089 FS-100 PNUSAL000571 PNUSAL000570 VA-WGS-00234 CFSAN010088 VA-WGS-00230 VA-WGS-00239 VA-WGS-00236 VA-WGS-00235 PNUSAL000569 PNUSAL000520 PNUSAL000517 PNUSAL000355 NYAG12B11863-1 PNUSAL001473 PNUSAL001406 PNUSAL001405 PNUSAL001431 PNUSAL001432 PNUSAL001433 PNUSAL001434 0.00001 VA-WGS-00225 VA-WGS-00231 VA-WGS-00221 PNUSAL000517 CFSAN010095 VA-WGS-00223 VA-WGS-00233 CFSAN010093 CFSAN010077 VA-WGS-00239 VA-WGS-00240 CFSAN010088 VA-WGS-00241 VA-WGS-00224 VA-WGS-00238 CFSAN010092 VA-WGS-00228 VA-WGS-00227 VA-WGS-00237 PNUSAL000571 PNUSAL000570 CFSAN010097 VA-WGS-00226 CFSAN010089 VA-WGS-00234 VA-WGS-00229 CFSAN010096 FS-106 VA-WGS-00236 CFSAN010098 VA-WGS-00222 VA-WGS-00235 CFSAN010094 FS-100 VA-WGS-00230 PNUSAL000520 PNUSAL000569 PNUSAL000355 NYAG12B11863-1 PNUSAL001473 PNUSAL001406 PNUSAL001405 PNUSAL001433 PNUSAL001431 PNUSAL001434 PNUSAL001432 0.001 Mashtree v0.08 5 minutes, 41 seconds CFSAN010097 CFSAN010098 CFSAN010089 VA-WGS-00221 CFSAN010095 CFSAN010092 CFSAN010088 VA-WGS-00239 CFSAN010093 CFSAN010077 CFSAN010094 VA-WGS-00222 VA-WGS-00237 CFSAN010096 VA-WGS-00229 VA-WGS-00231 VA-WGS-00226 VA-WGS-00227 FS-100 FS-106 VA-WGS-00238 VA-WGS-00240 VA-WGS-00228 VA-WGS-00234 VA-WGS-00241 VA-WGS-00236 VA-WGS-00224 VA-WGS-00230 NYAG12B11863-1 VA-WGS-00223 VA-WGS-00235 VA-WGS-00225 PNUSAL000569 PNUSAL000571 VA-WGS-00233 PNUSAL000355 PNUSAL000570 PNUSAL000517 PNUSAL000520 PNUSAL001473 PNUSAL001405 PNUSAL001406 PNUSAL001434 PNUSAL001432 PNUSAL001431 PNUSAL001433 0.0002 Mashtree v0.08 with min_abunda nce_filer 35 minutes VA-WGS-00222 VA-WGS-00226 CFSAN010092 VA-WGS-00221 VA-WGS-00228 VA-WGS-00234 VA-WGS-00238 VA-WGS-00235 CFSAN010077 PNUSAL000569 PNUSAL000571 VA-WGS-00236 VA-WGS-00225 VA-WGS-00233 VA-WGS-00224 VA-WGS-00240 VA-WGS-00230 CFSAN010098 CFSAN010093 CFSAN010094 CFSAN010097 CFSAN010095 CFSAN010088 CFSAN010089 VA-WGS-00237 PNUSAL000570 VA-WGS-00241 PNUSAL000355 PNUSAL000517 VA-WGS-00227 VA-WGS-00229 VA-WGS-00223 VA-WGS-00231 CFSAN010096 VA-WGS-00239 FS-100 FS-106 NYAG12B11863-1 PNUSAL001473 PNUSAL001405 PNUSAL001406 PNUSAL001432 PNUSAL001433 PNUSAL001431 PNUSAL001434 0.00005 Green indicates outbreak association Removed one outlier genome in the min_abundance_filter tree that flattened the entire tree 51

Can t we just estimate a trace back TransPhylo (Imperial College/bcCDC): http://biorxiv.org/content/bior xiv/early/2016/07/22/065334. full.pdf Outbreaker (Imperial College): http://journals.plos.org/plosco mpbiol/article?id=10.1371/jou rnal.pcbi.1003457 SCOTTI (University of Oxford): http://journals.plos.org/plosco mpbiol/article?id=10.1371%2f journal.pcbi.1005130 tree? 52

Acknowledgements Every single compgenomics class since 2008 My branch at CDC 53

Questions? Lee Katz lkatz@cdc.gov @lskatz github.com/lskatz The findings and conclusions in this presentation have not been formally disseminated by the Centers for Disease Control and Prevention and should not be construed to represent any agency determination or policy. The findings and conclusions in this presentation are those of the author and do not necessarily represent the official position of CDC 54