Nature Genetics: doi: /ng Supplementary Figure 1. Flow chart of the study.

Size: px
Start display at page:

Download "Nature Genetics: doi: /ng Supplementary Figure 1. Flow chart of the study."

Transcription

1 Supplementary Figure 1 Flow chart of the study. The strategy involving the four different approaches (epidemiological, clinical, genomics and experimental) used in this study is shown. C, clinical; F, food; A, animal infection; E, environmental.

2 Supplementary Figure 2 MLST diversity of lineages I and II represented by the genomes analyzed in the present study. MLST profiles of the genomes used in this study (orange, newly sequenced genomes; dark blue, public genomes) were compared to published MLST data 15 17,24,35 using the minimum spanning tree algorithm with the software tool BioNumerics v6.6 (Applied Maths). Each circle corresponds to a sequence type (ST). Gray zones surround STs that belong to the same clonal complex (CC). CC numbers are given next to the corresponding zones. The lines between STs are bold, plain, discontinuous and light discontinuous depending on the number of allelic mismatches between profiles (1, 2, 3, and 4 or more, respectively); note that links are only indicative, as alternative links with equal weight might exist. There were no common alleles between the two major lineages.

3 Supplementary Figure 3 Core and pan-genome size as a function of genome numbers. The number of genes in common (core genome; in green) and total number of homologous gene families (pan-genome; in blue) for increasing numbers of genomes (in-house R script). Numbers of gene families were estimated by performing 1,000 random different input orders of genomes. Solid lines correspond to the average number of gene families obtained by taking into account all permutations. Dashed lines indicate the standard deviation of the mean. The upper and lower edges of the blue and green areas correspond to the maximum and minimum numbers of gene families, respectively. For 104 sequenced genomes, the pan-genome and core genome comprised 6,867 and 1,791 genes, respectively. The core genome represented 60% of the average number of genes per genome and 26% of the pan-genome.

4 Supplementary Figure 4 Phylogenetic distribution and size variation of gene products encoded in the LIPI-1, LIPI-3 and SSI-1 genomic islands. The columns show the presence/absence and truncation/deletion status of virulence gene products. The maximum-likelihood phylogeny was obtained on the basis of the core genome of 104 L. monocytogenes isolates (Supplementary Note). Genes located in the same syntenic blocks are indicated with black lines above the corresponding genes. Gene products encoded by LIPI-3 are named with the terminal part of the locus tags (LMOf2365_1113 to LMOf2365_1119) in the F2365 genome. InlA truncation was the main feature correlated with hypovirulent clones. Strain LM13656 (CC2) was initially selected because it was non-hemolytic. Consistently, it showed a truncation in the hly gene encoding LLO (listeriolysin O), the virulence factor listeriolysin involved in hemolysis.

5 Supplementary Figure 5 Distribution and variability of virulence gene products. All 69 variable reported virulence genes 28 are shown except those already shown in Supplementary Figure 4. The colored rounded squares show the distribution and size variations of virulence gene products. The numbers above the figure correspond to genes listed in Supplementary Table 7. Genes present in all the genomes with invariable size are not shown. Genes located in the same syntenic blocks are indicated with black solid lines above the corresponding genes.

6 Supplementary Figure 6 Description of the CC4-associated PTS cluster LIPI-4. (a) The gene content of the PTS cluster and flanking core genes in the CC4 strain LM (below) is shown in comparison to the genome of CC1 strain LL195 (above). Putative functions are indicated. Identity percentages between sequences were determined by nucleotide BLAST and are represented using Easyfig 2.1. (b) The genomic region of the PTS in the CC4 strain LM in which the PTS cluster was deleted (CC4 PTS) is shown in comparison to the isogenic wild-type strain. Orange, genes of the PTS cluster; blue, flanking core genes.

7

8 Supplementary Figure 7 Implication of LIPI-4, the hypervirulent clone CC4-associated PTS, in CNS and placental infection. The data shown here are supplementary to Figure 5. (a,b) Humanized mice were inoculated orally at a dose of CFUs (a) or intravenously at a dose of CFUs (b) with reference strain EGDe (n = 5 in a), representative CC4 strain LM (CC4; n = 9 in a and n = 6 in b) or a whole-pts-cluster deletion mutant derived from LM (CC4ΔPTS; n = 7 in a and n = 6 in b). (c) Humanized mice were intravenously infected at a dose of CFUs by CC4ΔPTS containing either a single copy of pimc (n = 9) or pimc with the PTS cluster under its native promoter on the chromosome (n = 9). (d) The competition index of WT EGDe (n = 4) or WT CC4 (n = 4) was tested against chloramphenicol-resistant EGDe (EGDe containing pimc) in pregnant humanized mice. (e) The competition index of WT CC4 was tested against chloramphenicol-resistant CC4ΔPTS (pimc) (n = 3) or CC4ΔPTS (pimc-pts) (n = 4) in pregnant humanized mice. Pregnant humanized mice at day 14/21 of gestation were intravenously infected with a 1:1 mixture of the two strains as indicated at a total dose of CFUs. Mice were sacrificed on day 5 after infection when bacteria were orally inoculated (a) or day 2 after infection when intravenously infected (b e). Results are shown as medians with interquartile range. Each dot represents the bacterial count from a whole organ. Statistical analyses were carried out by a Dunn s multiple-comparison test (a), Mann-Whitney U test (b,c) or Wilcoxon matched-pairs signed-rank test (d,e): *P < 0.05, **P < 0.01, ***P <

9 Supplementary Note: Pulsed-field gel electrophoresis (PFGE) Since 2005, the NRC uses fully standardized processes to record information about the isolates. NRC and NRL laboratories use a standardized PFGE typing protocol, which enables comparison of PFGE patterns coming from both centers. All strains (6,842 isolates to be identified at clone level and the 796 isolates of the PFGE MLST reference library) were routinely typed by PFGE at the NRC or NRL according to the PulseNet standardized procedures with AscI and ApaI restriction enzymes 1. Data analysis was performed using BioNumerics v6.5 (Applied Maths). PFGE profiles were compared using the complete linkage clustering algorithm based on the number of different bands. Band comparison settings were set at 1.5% for the overall pattern matching optimization parameter and at 1% for band position tolerance. PCR serogrouping PCR serogrouping was performed for all strains using the PCR method of Doumith and colleagues (2004) 2, which classifies strains into four main serogroups: IIa, IIc, IIb, IVb. High confidence identification of MLST clones based on PFGE profiles Identification of MLST clones of the 7,342 isolates was performed by confronting their PFGE profiles (considering ApaI and AscI restriction enzymes) against a PFGE MLST reference library of 796 isolates typed both by PFGE and MLST, which was used to establish correspondences between both typing methods. The 796 strains were selected out of 4,045 isolates previously typed by PFGE using enzymes ApaI and AscI, separately. A UPGMA dendrogram based on the composite dataset made of the

10 ApaI and AscI patterns was constructed, and isolates were selected to cover the diversity of branches in order to obtain a collection of strains that reflects the diversity of the PFGE types. When several isolates represented the same PFGE type or branch, isolates previously analyzed by MLST and/or representing the most frequent PFGE pattern within a branch were selected in priority. The final collection represented the worldwide diversity of L. monocytogenes, as it contained isolates from 34 different countries of the 5 continents, including 104 isolates from non-french European countries, 31 from America, 15 from Africa, 6 from Oceania, 5 from Asia; all remaining isolates were collected and analyzed in the context of listeriosis surveillance in France. The reference library isolates were from clinical sources (n = 469, 58.9%), food (n = 183, 23.0%), animals (n = 52, 6.5%), the environment (n = 34, 4.3%) or unknown sources (n = 58, 7.3%). In this study, 336 isolates were subjected to MLST characterization, whereas 460 strains had been previously analyzed by MLST 3-6. The reference library represented a total of 33 singletons and 41 clonal complexes. An identification library was constructed using BioNumerics v6.5. Library units were defined for each MLST clone, and were populated with all isolates belonging to the corresponding clone. Because PFGE patterns of CC8 and CC16 were very similar, these two clones were merged into a single unit, CC8-16. The same was true for CC101 and CC90 (CC101-90). All other MLST clonal complexes and singletons were clearly distinguished by PFGE and therefore represented single library units. PFGE patterns of 6,842 isolates to be identified (7,342 minus the 500 reference isolates already typed by MLST) were then compared one by one to all isolates of the reference library, and a similarity score was calculated. This score was calculated for each enzyme separately by subtracting to 100 a distance represented by the total number of bands that were distinct in the two compared patterns. The final similarity score corresponds to the average

11 score of the two enzymes. To minimize false identification, the following algorithm was then used. First, the query isolate was tentatively assigned to the clone of the library unit containing the PFGE profile having the best similarity score. Second, if the similarity score was lower than 97.5%, the isolate was excluded as too distant. Third, as in some cases, we observed that the query pattern was equidistant (or nearly so) to the best match of two or more library units, we recorded the similarity score to the second best-matching clone, and we validated the identification only if this second best similarity score was at least 1% lower than the score of the best-scoring clone. In all other cases, the isolate was excluded because of lack of confident discrimination between the two best candidate clones for this isolate. The above thresholds, called SST for the similarity score threshold and DST for the threshold of difference of similarity score between the two best-scoring clones, were selected to maximize the trade-off between sensitivity and specificity (Figure below). To do that, each isolate of the reference library (for which the clone was determined by MLST) was queried against all the other isolates based on its PFGE pattern, and different values of SST (varying from 95% to 99%), and of DST (0.5% and 1%), were tested in conjunction to assign isolates to clones. The induced percentages of misidentification and of non-identification were recorded (Figure below). Based on these simulations, SST was fixed to 97.5% and DST to 1%, as this combination was the best compromise between misidentification and non-identification. Using those thresholds, the percentage of misidentification was 0.38%, and the percentage of non-identification was 5.89%.

12 Misidentification (%) Non-identification (%) DST 1% DST 0.5% SST(%) By using the above algorithm, 709 isolates could not be assigned to a clone (9.7%). The 6,133 which were assigned to a clone based on their PFGE patterns were added to the 500 isolates of the reference library for which the clone was defined directly by MLST and which met the epidemiological inclusion criteria (food and clinical isolates, ). Therefore, the collection used for the source distribution analysis was composed of 6,633 isolates. In total, they represented 24 singletons and 39 clonal complexes, covering the breadth of clonal diversity of Lm. Contingency table used for statistical tests performed to identify associations of clones with food and clinical sources Tested Source All other sources Tested CC # Isolates # Isolates All other CCs (of the species or of the appropriate lineage) # Isolates # Isolates

13 Phylogenetic analyses based on the core genomes We first reconstructed a phylogenetic tree with all isolates from lineages I and II based on the 1,791 genes of the core genome. A multiple amino-acid sequence alignment was performed for each family separately, using MUSCLE v.3.6 (default parameters; 7 ) and back-translated into a codon-level alignment. The BMGE program v.1.1 was used to select characters well-suited for phylogenetic tree reconstruction (BLOSUM30 similarity matrix, gap rate cut-off = 0.20, sliding window size = 3, entropy score cut-off = 0.5; 8 ). Phylogenetic trees were inferred with RAxML with evolutionary model GTR + Γ 4 + I. As recombination could affect phylogenetic inference, we performed additional phylogenetic analyses after discarding regions likely to be recombined and homoplastic characters. Three independent tree reconstructions were performed: lineage I, lineage II, and both lineages, each based on its corresponding core gene set. The core genome of the separated lineages being composed of more gene families than the core genome of the 104 considered strains, considering each of them separately is expected to improve the robustness of the phylogenetic analyses. As all strains belonging to a same clone are very closely related according to the 104-strain RAxML phylogenetic tree (see above), we selected only one representative taxon within each clone. Ten taxa were used for lineage I, 17 of lineage II and 27 for lineage I + II. The recombination rates being different between the two lineages, we adjusted the parameters used for the elimination of recombination and of homoplasic characters to each dataset. Given a pair of aligned sequences ij within a locus, the number s ij of observed substitutions (or SNPs) on l ij aligned characters could be considered as distributed following a Negative Binomial distribution NB(p,r), where parameters p and r are mainly depending on ij and l ij. Therefore, too

14 small or too large values of s ij (likely caused by homologous recombination events between the two sequences or from a third one, respectively) could be assessed from NB (p,r). This property was used to discard those loci likely affected by homologous recombination events in order to infer clonal trees (i.e. displaying the clonal evolutionary history of the selected strains). First, a multiple genome alignment was computed by progressivemauve 9 as well as a multiple sequence alignment for each locus. Second, for each pair of isolates ij within each locus, the corresponding pair of aligned sequences was considered to get the number of SNPs s ij and the number of aligned characters l ij. Third, for each pair ij and length l ij, a distribution of observed SNP numbers was computed from the aligned genomes with a sliding window of length l ij. Fourth, for each distribution of observed SNP numbers, the associated theoretical NB distribution was estimated (i.e. parameters p and r were estimated to fit the observed distribution). Fifth, for each pair ij, each observed value s ij was assessed to be not too small or too large from the associated theoretical NB distribution by computing the p-value p ij = min[ P(X < s ij ), P(s ij < X) ] where X ~ NB(p,r); from this definition, every pair ij with p ij < 5% was considered as non-orthologous (i.e. not arising from a clonal process), and every multiple sequence alignment containing at least one pair of non-orthologous sequences was discarded. Sixth, we also discarded all multiple sequence alignments showing conflicting mosaic patterns between parsimony informative characters, as assessed by the pairwise homoplasy index (PHI) test ( 10 ; 5% threshold). Seventh, remaining parsimony informative characters assessed as homoplastic were discarded with the program Noisy 11. Eighth, all remaining multiple sequence alignments (1,588 for lineage I, 924 for lineage II and 1,144 for lineages I + II) were concatenated into a supermatrix of characters that was used to infer a clonal tree with PhyML 12 with evolutionary model GTR + Γ 4 + I.

15 We thus obtained three phylogenetic trees (i.e. lineage I, II and I + II) whose topologies likely display the clonal evolutionary process. For each of these three topologies, branch lengths were refitted from the concatenation of the initial multiple sequence alignments in order to infer real substitution rates across branches. Finally, branch refitted phylogenetic trees of each lineage I and II were rooted and regrafted based on the branch refitted tree inferred from lineage I + II data. References for Supplementary Note: 1 Graves, L. M. & Swaminathan, B. PulseNet standardized protocol for subtyping Listeria monocytogenes by macrorestriction and pulsed-field gel electrophoresis. Int J Food Microbiol 65, (2001). 2 Doumith, M., Buchrieser, C., Glaser, P., Jacquet, C. & Martin, P. Differentiation of the major Listeria monocytogenes serovars by multiplex PCR. J Clin Microbiol 42, (2004). 3 Ragon, M. et al. A new perspective on Listeria monocytogenes evolution. PLoS Pathog 4, e (2008). 4 Chenal-Francisque, V. et al. Worldwide distribution of major clones of Listeria monocytogenes. Emerg Infect Dis 17, (2011). 5 Chenal-Francisque, V. et al. Optimized Multilocus variable-number tandem-repeat analysis assay and its complementarity with pulsed-field gel electrophoresis and multilocus sequence typing for Listeria monocytogenes clone identification and surveillance. J Clin Microbiol 51, (2013). 6 Cantinelli, T. et al. "Epidemic clones" of Listeria monocytogenes are widespread and ancient clonal groups. J Clin Microbiol 51, (2013). 7 Edgar, R. C. MUSCLE: multiple sequence alignment with high accuracy and high throughput. Nucleic Acids Res 32, (2004). 8 Criscuolo, A. & Gribaldo, S. BMGE (Block Mapping and Gathering with Entropy): a new software for selection of phylogenetic informative regions from multiple sequence alignments. BMC Evol Biol 10, 210 (2010).

16 9 Darling, A. C., Mau, B., Blattner, F. R. & Perna, N. T. Mauve: multiple alignment of conserved genomic sequence with rearrangements. Genome Res 14, (2004). 10 Bruen, T. C., Philippe, H. & Bryant, D. A simple and robust statistical test for detecting the presence of recombination. Genetics 172, (2006). 11 Dress, A. W. et al. Noisy: identification of problematic columns in multiple sequence alignments. Algorithms Mol Biol : AMB 3, 7 (2008). 12 Guindon, S. et al. New algorithms and methods to estimate maximum-likelihood phylogenies: assessing the performance of PhyML 3.0. Syst Biol 59, (2010).