ESCMID Online Lecture Library. by author

Size: px
Start display at page:

Download "ESCMID Online Lecture Library. by author"

Transcription

1 ESCMID WS Rapid NGS for Characterization and Typing of Resistant Gram-Negative Bacilli 7-9 October 2015 João André Carriço, Microbiology Institute and Instituto de Medicina Molecular, Faculty of Medicine, University of Lisbon

2 Bacterial chromosome Single locus variant (SLV): Double locus variant (DLV): Triple locus variant (TLV): To each unique gene sequence (allele) is attributed an integer ID, by comparison with online DBs Allelic profile: Each allelic profile, aka ST, is unequivocally identified by an integer. MLST Strenght: Common Nomeclature!

3 MLST based on internal fragments of seven* housekeeping genes (~ bp) Housekeeping gene MLST locus Housekeeping genes : Present in all the strains Locus Size : based on the PCR product size limitation *in the great majority of current schemas

4 sample NGS WGS SNP approaches: reads Mapping to reference Needs a reference strain: Outbreak determination Comparative studies Monomorphic (Clonal) species Recombination/Horizontal gene transfer is a problem Difficult to create a nomenclature VCF/Fasta File with SNPs

5 sample NGS WGS reads Central nomenclature server: Schemas, Allele definitions and identifiers Output :Allelic Profile assembly contigs Gene-by-gene approaches: No need for reference strain Buffers recombination effect Simpler to create a nomenclature Population structure of nonmonomorphic species Multiple Schemas can be defined for a single species

6 Isolate Genome* Source: Nick Loman * Chromosome + Plasmids + Phages Sequenced Reads Contamination Other isolates in the sequencing run

7

8 Strain 1 Strain 2 Strain 3 Strain 4 Strain 5 Strain 6 L0 L1 L2 L3 L4 L5 L6 L7 L8 L9 Starting from Draft Genomes (reads assembled to contigs) Whole Genome MLST

9 Strain 1 Strain 2 Strain 3 Strain 4 Strain 5 Strain 6 L0 L2 L4 L7 L8 L9 L1 L3 L5 L6

10 Strain 1 Strain 2 Strain 3 Strain 4 Strain 5 Strain 6 L1 L2 L4 L7 L8 L9 L2 L3 L5 L6 Core Genome -> cgmlst Accessory genome Core Genome+ Accessory Genome = PanGenome -> wgmlst.but what is a locus? How we define them and create a schema?

11 .GBK file.gbk file Starting with completely assembled genomes and annotated draft genomes BLAST Prodigal (Prokaryotic Dynamic Programming Gene finding Algorithm): Core genome: common to all Accessory genome: present in some

12 Sequence QA/QC FastQC Assembly SPAdes Velvet Annotation: Prokka oftware.prokka.shtml Prokaryotic gene prediction Prodigal

13 Algorithm to remove possible duplicated loci and determine pan genome () : Merge all.ffn files from complete genomes. For each genome the ffn files are assumed to contain all possible loci (Coding sequences). Remove loci with size smaller than 200bp Remove all identical loci (100% identity and size). Remove loci contained in larger loci (100% identity and different sizes) keeping the larger loci. BLAST the file against himself. Remove loci that have an alignment length larger than 70% of another loci (locus are considered the same)

14 Problems with this approach: BLAST settings and identity cut-offs : the order of the input files for the search will create different cluster of orthologous genes (COGS) 60% s1 s2 s3 80% 80% Be aware of paralogous (multiple copy) genes! Blast HSP

15 Problems with this approach: Different size (size difference thresholds) Wrongly annotated (merging of, missing start and end codons) Should we trim the resulting sequences? Software typically used: - OrthoMCL ( (Li L. Genome Research 2003) - CD-HIT ( (Fu L et al Bioinformatics 2012) - BIGsDB ( (Jolley K and Maiden M. BMC Bioinformatics 2010) Blast HSP So a wgmlst Locus can be a or a part of a

16 Having a set of locus defined and seeded with alleles, each allele can be attributed an allele identifier and a schema can be created

17 (aka allele calling) Two possibilities: De novo assembly and BLASTing the contigs annotated for * against the schema and recovering the allele (which can be a novel allele) Use a mapping approach of reads to alleles ( Inouye M. BMC Bioinformatics (how does it deal with insertions deletions??) *One alternative prodigal ( (Hyatt D et al BMC Bioinformatics 2010)

18 Bacterial Isolate Genome Sequence Database Jolley & Maiden 2010, BMC Bioinformatics 11: PROs: Freely available, open-source, handles thousands of genomes, has several schemas implemented for MLSTfor several bacterial species, and some extended MLST and core genome MLST (mainly Neisseria sp. but soon to be expanded) CONs: Requires Perl knowledge to install and maintain Ridom SeqSphere+ Commercial software with client server solutions from assembly to allele calling and visualization for core genome MLST (MLST+/ cgmlst) Applied Maths - Bionumerics Commercial software with client server solutions from assembly to allele calling and visualization for whole genome MLST (wgmlst)

19 Core Genome addressing synteny:

20 Core Genome Addressing synteny and paralogy:

21 Run prodigal on genome Translate to protein BLAST Calculate BSR BSR: Blast Score Ratio No blast match or BSR<=0.6 Translate gene file to protein Self BLAST Calculate BSR Gene BLAST database BSR =1 & same DNA seq? Re-do Gene BLAST database LOT? LNF Exact Match LOT Add new allele to gene file Calculate BSR of the new allele LOT: Locus On the Tip (of a contig) BSR>0.6 Inferred Allele Allelic profile Prodigal (Prokaryotic Dynamic Programming Gene finding Algorithm)

22 ATGGCAATATTTTTCATGATTTTTCTGATTGTTTGTGTGCTCCTATTGGTGATAGTCACACTGAGTACAGTT TATGTGGTTCGTCAGCAGTCGGTGGCGATTATTGAACGCTTTGGGAAATACCAAAAGGTTGCTAATA GCGGTATTCATATTCGCTTGCCTTTTGGGATTGACTCGATTGCAGCACGGATTCAGTTGCGCTTGTTG CAAAGTGATATTGTGGTTGAGACTAAGACCAAGGACAATGTGTTCGTTATGATGAATGTAGCGACTC AGTACCGTGTCAACGAGCAGAGCGTGACAGATGCTTACTATAAACTCATACGTCCAGAATCTCAGAT TAAATCTTATATCGAAGATGCTCTTCGCTCTTCTGTTCCAAAATTAACCTTGGATGAATTGTTTGAGAA AAAAGATGAGATTGCCCTTGAAGTTCAACACCAAGTAGCAGAAGAAATGACCACTTACGGCTACATT ATCGTGAAAACCTTGATTACCAAGGTCGAACCGGATGCAGAAGTTAAGCAATCCATGAATGAAATCA ATGCGGCGCAACGTAAGCGGGTCGCAGCACAAGAATTGGCGGAAGCTGACAAGATTAAAATTGTCA CTGCAGCTGAAGCCGAAGCAGAAAAAGACCGCCTTCATGGTGTGGGGATTGCCCAACAACGTAAGG CGATTGTGGATGGATTGGCAGAGTCTATCACCGAACTCAAGGAAGCCAATGTTGGCATGACAGAAG AACAAATCATGTCTATCCTCTTGACCAACCAGTATTTGGATACCTTGAATACCTTTGCCTCTAAAGGA AATCAAACCATCTTTTTACCAAATACGCCAAATGGTGTGGATGATATCCGAACACAAATCTTGTCAGC CCTTCGCGCTGAGAAGAAATAA S.pneumoniae protease High number of initiator codons, in different frames (reversed ORF's not being taken in account) Frame 1: Start, Stop Frame 2: Start Frame 3: Start, Stop

23 LOT PLOT-Possible LOT Size threshold selection Choosing allele length mode +/- 20% More than one match with BSR > 0.6 Non Informative Paralogous Locus

24 Usually a few genomes have the large number of of missing genes : Bad sequencing run, contamination?? -> Remove these genomes from the analysis Due to the similarity creep the analysis should be rerun every time you finish a batch of genomes to be tested Assembler choice and parameter choice can lead to different alleles in the same loci

25 Since you have an allelic profile you can use a goeburst /Minimum Spanning Tree approach Counts the number of allele profiles differences and creates the tree using tiebreak rules based on model of evolution (more on this tomorrow)

26 wgmlst /cgmlst approaches: Schema : detailing loci selection methods is important and needed for comparison Allele calling: Again different approaches are possible and need to be detailed in order to be compared. The results can vary but the general tree structure is the same How to handle missing data? Nomenclature for type definition is still an open issue due to variable number of loci and several competing schemas will appear. Needs a nomenclature server for alleles and types

27 UMMI Members: Mickael Silva Sergio Santos Bruno Gonçalves Adriana Policarpo Miguel Machado Mário Ramirez José Melo-Cristino FP7 PathoNGenTrace ( Dag Harmsen (Univ. Muenster) Stefan Niemann (Research Center Borstel) Keith Jolley, James Bray and Martin Maiden (Univ. Oxford) Joerg Rothganger (RIDOM) Hannes Pouseele (Applied Maths) Genome Canada IRIDA project ( INESC-ID Members: Alexandre Francisco Cátia Vaz Pedro Tiago Monteiro Franklin Bristow, Thomas Matthews, Aaron Petkau, Morag Graham and Gary Van Domselaar (NLM, PHAC) Ed Taboada and Peter Kruczkiewicz (Lab Foodborne Zoonoses, PHAC) Fiona Brinkman (SFU) William Hsiao (BCCDC) INTEGRATED RAPID INFECTIOUS DISEASE ANALYSIS

28 Registration and Abstract submission open at :