Using Galaxy for the analysis of NGS-derived pathogen genomes in clinical microbiology

Similar documents
Next generation sequencing in diagnostic laboratories: opportunities and challenges

Current status of universal whole genome sequencing of Mycobacterium tuberculosis in the United States

Introduction to CGE tools

VTEC strains typing: from traditional methods to NGS

2014 APHL Next Generation Sequencing (NGS) Survey

The implementation and application of Whole Genome Sequencing in the Campylobacter Reference Laboratory at Public Health England Craig Swift

CDC s Advanced Molecular Detection (AMD) Sequence Data Analysis and Management

Canada's IRIDA platform for genomic epidemiology. Gary Van Domselaar Chief, Bioinformatics National Microbiology Lab Public Health Agency of Canada

Whole Genome Sequence Data Quality Control and Validation

EURL WORKING GROUP ON WHOLE GENOME SEQUENCING AND PULSENET INTERNATIONAL

Practical quality control for whole genome sequencing in clinical microbiology

A year in clinical bioinformatics

The Basics of Understanding Whole Genome Next Generation Sequence Data

GENOTYPING-BY-SEQUENCING USING CUSTOM ION AMPLISEQ TECHNOLOGY AS A TOOL FOR GENOMIC SELECTION IN ATLANTIC SALMON

High-resolution outbreak tracing and resistance detection using whole genome sequencing in the case of a Mycobacterium tuberculosis outbreak

Challenges and opportunities for whole genome sequencing based surveillance of antibiotic resistance

The Basics of Understanding Whole Genome Next Generation Sequence Data

Introduction to Whole Genome Sequencing and its Applications in Microbial Diagnostics

Using New ThiNGS on Small Things. Shane Byrne

Transcriptome Assembly, Functional Annotation (and a few other related thoughts)

Genomics and its Impact on Diagnostic Microbiology

Bioinformatics- Data Analysis

CBC Data Therapy. Metagenomics Discussion

Introduction to NGS Analysis Tools

Introduction to DNA-Sequencing

What Are We Trying to Say Here? Standardizing Next Generation Sequencing Reports for Tuberculosis

Data Intensive Biomedical Research: The EU RL VTEC efforts to take up the NGS challenge. EU RL for E. coli Annual Workshop 2015

Evolution of Next Generation Sequencing Technology: Ready for Patient Management?

Rue Juliette Wytsmanstraat Brussels Belgium T F

Ribotyping Easily Fills in for Whole Genome Sequencing to Characterize Food-borne Pathogens David Sistanich

WGS Analysis and Interpretation in Clinical and Public Health Microbiology Laboratories: What Are the Requirements and How Do Existing Tools Compare?

ngs metagenomics target variation amplicon bioinformatics diagnostics dna trio indel high-throughput gene structural variation ChIP-seq mendelian

Introduction to the MiSeq

HLA and Next Generation Sequencing it s all about the Data

CBC Data Therapy. Metatranscriptomics Discussion

Providing clear solutions to microbiological challenges TM. cgmp/iso CLIA. Polyphasic Microbial Identification & DNA Fingerprinting

Abstract OPEN ACCESS. a11111

Introduction to Bioinformatics

Next-Generation Sequencing Services à la carte

RADSeq Data Analysis. Through STACKS on Galaxy. Yvan Le Bras Anthony Bretaudeau Cyril Monjeaud Gildas Le Corguillé

Introduction to Whole Genome Sequencing and its Applications in Microbial Diagnostics

Bioinformatic analysis of Illumina sequencing data for comparative genomics Part I

Introductie en Toepassingen van Next-Generation Sequencing in de Klinische Virologie. Sander van Boheemen Medical Microbiology

Faramarz Valafar.

Introduction to Bioinformatics

Introduction to PulseNet WGS Tools in BioNumerics v7.6

UAB DNA-Seq Analysis Workshop. John Osborne Research Associate Centers for Clinical and Translational Science

Detecting Clusters and Reporting Results

From Bands to Base Pairs: Implementation of WGS in a PulseNet Laboratory

Microbial sequencing solutions

Whole Genome Sequencing for food safety FSA Chief Scientific Advisor Report and 2013 Listeria pilot study

ACCELERATING GENOMIC ANALYSIS ON THE CLOUD. Enabling the PanCancer Analysis of Whole Genomes (PCAWG) consortia to analyze thousands of genomes

Analytics Behind Genomic Testing

Microbiological diagnosis of Bacillus anthracis. Member of the Bacillus cereus group

Mate-pair library data improves genome assembly

G E N OM I C S S E RV I C ES

Questionnaire on the use of High Throughput Sequencing, Bioinformatics and Computational Genomics (HTS-BCG) in the OIE Reference Centre network

Gene Prediction Group

GenoLogics LIMS Helps EdgeBio Build Services Business

Validating Bionumerics 7.6: A strategic approach from Oregon

Targeted Sequencing in the NBS Laboratory

Name: Ally Bonney. Date: January 29, 2015 February 24, Purpose

Overview of CIDT Challenges and Opportunities

NEXT GENERATION SEQUENCING. Farhat Habib

Data Basics. Josef K Vogt Slides by: Simon Rasmussen Next Generation Sequencing Analysis

by author Bacterial typing - what methodology should I use? MTE Session ECCMID 2017 VIENNA, 25 APRIL 2017 L u í s a V i e i ra P e i xe

Programmatic Implementation of NGS for TB & Future Plans for the ReSeqTB Knowledgebase

Lecture 7. Next-generation sequencing technologies

Whole Genome Sequencing for Enteric Pathogen Surveillance and Outbreak Investigations

Pathogenic organisms no thanks: Use of next generation sequencing techniques in risk assessment and HACCP

Matthew Tinning Australian Genome Research Facility. July 2012

BIOINFORMATICS TO ANALYZE AND COMPARE GENOMES

Studying the Human Genome. Lesson Overview. Lesson Overview Studying the Human Genome

Laboratory Methods: Tuberculosis Diagnosis

Deep Sequencing technologies

Understanding the science and technology of whole genome sequencing

Applications of Next Generation Sequencing in Metagenomics Studies

Introduction to Whole Genome Sequencing and its Applications in Microbial Diagnostics

Potential of human genome sequencing. Paul Pharoah Reader in Cancer Epidemiology University of Cambridge

ESCMID Online Lecture Library. by author

The Genome Analysis Centre. Building Excellence in Genomics and Computa5onal Bioscience

Ecole de Bioinforma(que AVIESAN Roscoff 2014 GALAXY INITIATION. A. Lermine U900 Ins(tut Curie, INSERM, Mines ParisTech

Lesson Overview. Studying the Human Genome. Lesson Overview Studying the Human Genome

E2ES to Accelerate Next-Generation Genome Analysis in Clinical Research

Antimicrobial Resistance Prediction Using Deep Convolutional Neural Networks on Whole Genome Sequencing Data

'Bioinformatics in academia as related to ehealth' - including the "Genomic Virtual Lab"

THE RISE OF WHOLE GENOME SEQUENCING AS A SUBTYPING TOOL FOR MICROBIAL SOURCE TRACKING: FROM FUNDAMENTALS TO APPLICATIONS

Whole-Genome Sequencing (WGS) for Food Safety

Welcome to the NGS webinar series

Whole genome sequencing in the reference laboratory: An Introduction & Overview

Beef Industry Safety Summit Renaissance Austin Hotel 9721 Arboretum Blvd. Austin, TX March 1-3

Incorporating Molecular ID Technology. Accel-NGS 2S MID Indexing Kits

Development and Implementation of a Quality System for Next-Generation Sequencing

Next Generation Sequences & Chloroplast Assembly. 8 June, 2012 Jongsun Park

solid S Y S T E M s e q u e n c i n g See the Difference Discover the Quality Genome

DNA. bioinformatics. genomics. personalized. variation NGS. trio. custom. assembly gene. tumor-normal. de novo. structural variation indel.

De novo assembly in RNA-seq analysis.

Machine Learning and Data Fusion Methods for Phenotype-based Threat Assessment of Unknown Bacteria

Hosted by Paul Webber page 1

Abstract... i. Committee Membership... iii. Foreword... vii. 1 Scope Introduction Standard Precautions References...

Transcription:

Using Galaxy for the analysis of NGS-derived pathogen genomes in clinical microbiology Anthony Underwood*, Paul-Michael Agapow, Michel Doumith and Jonathan Green. Bioinformatics Unit, Health Protection Agency, Colindale

The Health Protection Agency The Health Protection Agency's role is to provide an integrated approach to protecting UK public health The role of the Microbiology Services Division is to provide specialist and reference microbiology to assist with infectious disease surveillance microbial epidemiology co-ordination of the investigation and cause of national and uncommon outbreaks Equivalent to the CDC in the USA

The Health Protection Agency: Activities The specialist and reference microbiology activities are comprised of two primary functions Identification Determining the species of an infectious agent Is the microbe responsible for the disease symptoms described for the patient? Typing Determining the strain of the an infectious agent Does the microbe have the same type as others seen in an outbreak or that seen in environmental or food samples

The changing face of microbiology Public health microbiology was based for a long time on phenotypic testing Selective growth media Colony morphology Gram straining and cell morphology Serotyping Biochemical tests Over the last 2 decades some of the functions have become replaced with molecular tests Identification 16S rrna gene sequencing Other genes for difficult groups such as Bacillus species

The changing face of microbiology 2 Typing microbes has seen the biggest revolution with many molecular tests now commonly used Multi Locus Sequence Typing (MLST) Sequencing of 7 house keeping genes resulting in an allelic profile where a single base change results in a new allele Locus/ ST adk fumc gyrb icd mdh pura reca 10 10 11 4 8 8 8 2 Some bacteria require additional loci to provide sufficient discrimination For example pora and fetb sequencing in Neisseria

The changing face of microbiology 3 Other molecular typing techniques Some organisms are typed using the sequence from a single gene For example sequencing of the emm gene that codes for the M protein can replace the Lancefield serotyping scheme for Group A Streptococci Drug resistance determination e.g mutations in rpob and gyra causing resistance to rifampicin and fluoroquinolones respecutively in Mycobacterium tuberculosis Multi Locus VNTR Analysis (MLVA) The copy number at several repeat loci are concatenated to produce a digital barcode/profile e.g 2-5-4-2-1 These profiles are compared to identify types

Next Generation Sequencing and Microbiology Next Generation sequencing may change the way we do pubic health microbiology The average microbial genome is relatively small By multiplexing samples using molecular tags and the amount of data generated by the Illumina HiSeq machines high coverage paired end data can be generated for 100 ( 115) This will probably fall to approx 40 ( 45) by end of 2011 These prices are close to or cheaper than that required for MLST

Next Generation Sequencing: The future? Looking ahead it is not too crazy to suggest that every pathogen isolated from a patient will have its entire genome sequenced

Next Generation Sequencing and Microbiology 2 There is already the potential to genome sequence an infectious agent and perform typing + The MLST type can be determined But so can the presence/sequence of other genes Virulence gene profiles Resistance genes Point mutations in genes involved in the infectious process Any other gene that at a later time may be of interest great for retrospective studies

Next Generation Sequencing and Microbiology 3 The current Next Generation technologies have limitations for real time results since library prep and sequencing times take days/weeks New technologies such as Ion Torrent or new machines such as the MiSeq promise much faster sequence delivery in under 24 hours For the moment the utility of NGS is confined to medium term projects However it is capturing the imagination of public health microbiologists The problem is in the analysis

Next Generation Sequencing Analysis Over 50 NGS projects underway Very few bioinformaticians attached to projects The burden of analysis falls on a core team of 3 or 4 bioinformaticians

Galaxy and microbial genome analysis Enter Galaxy Assessment of Galaxy led us to believe that it might provide a solution and kill 2 birds with 1 stone Provide a means for laboratory scientists with little/no command line or bioinformatics analysis to analyse NGS data Relieve the burden on bioinformaticians of having to perform processing steps enabling them to concentrate on more complex downstream comparative analyses

Galaxy use within the HPA Warning

Galaxy and microbial genome analysis 2 What kind of simple analyses might clinical microbiologists want to perform? QC assessment of samples before further processing Mapping of reads to a reference SNP calling and filtering of interesting SNPs De novo assembly with QC gateways Assigning MLST type Determine genotype e.g emm type Produce virulence profile

Galaxy MLST determination Scripts already existed within our group that could extract MLST and virulence profiles Scripts written within the group are in a range of languages python, ruby, perl, C++ The ability to use existing Galaxy NGS tools in combination with in house scripts provided the flexibility to deliver bespoke solutions The fact that Galaxy is language agnostic makes it an appealing solution to our polyglot group

Galaxy MLST pipeline Galaxy tools: FASTQ Groomer Trimmer Custom scripts: ABySS assembly MLST profile MLST profile Make a blast database from de novo assembly contigs Extract sequence of 7 loci by blast from contigs Compare each locus sequence with MLST database to discover an exact match (existing allele) or inexact match (new allele) Custom tools/scripts Galaxy tools

Galaxy MLST input

Galaxy MLST input

Galaxy MLST results A paired end data set consisting of 14 million reads took 1 hour to convert, trim, assemble and call the MLST profile. Hands on time 1 minute!

Galaxy Typing by reference genes We have: A set of reads from an unknown (untyped) microbe(s) Already characterised sets of reference (usually virulence) genes Typing scheme(s) based on the presence and absence of given reference genes We want to know: Whether any genes of interest are present Based on presence/absence what types are present

Galaxy Generic genotyping A simple, generic, extensible, updatable approach: Inputs microbial genomes are just Fasta files References, likewise Typing schemes are just a table The script builds a database from the inputs, blast the references against it, and looks up the results in the typing scheme table >aida ATGAATAAGGCCTACAG TATCATATGGAGCCACT CCAGACAGGCCTGGAT TGTGGCCTCAGAGTTA GCCAGAGGACATGGTT TTGTCCTTGCAAAAAAT ACACTGCTGGTATTGGC GGTTGTTTCCACAATC, chua, yja2, TSPE4_C2 B2, +, +, ~ D, +, -, ~ B1, -, ~, + A, -, ~, -

Galaxy Generic genotyping 2

Galaxy Generic genotyping 3 Make a virtue of laziness Use standard, simple types User can select as many input, references and typing tables as needed Use metadata of Fasta headers to usefully label output Output is saved as YAML ---- Datetime: 2011-05-23T16:16:20+01:00 Hits: - Name: unknown-12 - Name: aah et al. Matches: Full: [aah] Partial: [] Phylo_matches: [B2] - Name: aida and iron Matches: Full: [] Partial: [iron, ompt] Phylo_matches: [D1, B2]

Galaxy Galgen Being even more virtuous There s a lot of repetition in Galaxy tool construction Can we save effort in making a new tool? Can we prevent errors by automating tool generation? Yes Label-seqs-by-data.rb in-table epidates.csv uk.fasta To label-seqs-by-date tool dir, template and conf entry

Galaxy Galgen 2 Galgen: Sniffs a command-line and infers tool and executable name, options, input datasets and outputs, etc. Checks these with the user Generates necessary basic tool config and template files Uses hints on command-line (bracket options, file extensions, etc.) Label-seqs-by-data.rb ( in-table epidates.csv) input_uk.fasta Can t guess everything, but aim for all simple cases and provide skeleton for more complex. Coming soon ( a month)

Galaxy Future Direction To process genomes and call SNPs To filter SNPs for those in genes of interest To report SNPs that may result in drug resistance To develop a generic genotyper that can extract the sequence used in genotyping from a draft genome and call the type For longer read (454) data report copy number for repeats that have a short enough repeat length

Galaxy Additional functionality Tasks we need to complete With 50 projects anticipated we need to find an efficient way of storing and organising data using Galaxy datasets To fully integrate the Galaxy instance with our Condor cluster to be able to perform jobs more efficiently in parallel Desirables To process multiple samples with one workflow and organise the final results that makes it easy to link samples to results To organise data sources so scientists can easily select which of 100s of samples to process To organise results so scientists other than those performing the analyses can quickly navigate and view them.

Acknowledgments Galaxy Team Laboratory Scientists from Health Protection Agency ARMRL LHI APU