Habitat distribution of GEBA-I genomes.

Size: px
Start display at page:

Download "Habitat distribution of GEBA-I genomes."

Transcription

1 Supplementary Figure 1 Habitat distribution of GEBA-I genomes. Phyla comprising GEBA-I genomes are depicted clockwise starting at 12 o clock. Habitat isolation source is depicted from 6 o clock. Ribbons are colored based on phylum and width of the ribbon is scaled to the number of genomes corresponding to a phylum.

2 Supplementary Figure 2 Genome size (a) and GC content (b) distribution of GEBA-I genomes per phylum.

3 Supplementary Figure 3 Distribution of number of scaffolds per phylum within GEBA-I genomes. Violin plot indicating the number of total assembled scaffolds (y-axis) per phylum within GEBA-I genomes (xaxis). White dots represent median counts.

4 Supplementary Figure 4 Ktedonobacter racemifer histidine kinase cluster (ID: ) displaying novel domain configuration. IMG gene ID numbers are displayed on the left.

5 Supplementary Figure 5 Gene duplication in Sphaerochaeta coccoides as evidenced by presence of tandem repeats of genes (shown in red) coding for alpha-tubulin suppressor. Genes from four distinct regions in the genome are depicted. IMG gene IDs for alpha-tubulin suppressor (A) , , , , ; (B) , , (C) , and (D) , , ,

6 Supplementary Figure 6 Multiple sequence alignment of Endozoicomonas elysicola pepsin A with the eukaryotic pepsins and the recently characterized pepsin homolog from Shewanella denitrificans. The conserved active residues are marked with asterisk, hydrophobic-hydrophobic-gly motif is marked by black arrows and the predicted E. elysicola specific signal peptide cleavage site is indicated by the blue arrow

7 Supplementary Figure 7 Number of Biosynthetic Clusters (y-axis) in each GEBA-I genome with respect to its genome size (x-axis) in Mbp. Individual genomes are colored by phylum.

8 Supplementary Figure 8 Biosynthetic genes potentially responsible for the production of pelagiomicin phenazine antibiotic by M. variabilis ATCC

9 Supplementary Figure 9 Distribution of the percentage of metagenome hits per GEBA-I genome by phylum. Solirubrobacter soli, Ktedonobacter racemifer and Niastella koreensis represents the top three GEBA-I genomes, which recruited the highest percentage of metagenome proteins. Inset: Habitat of metagenome proteins recruited by GEBA-I genomes.

10 Supplementary Figure 10 Distribution of various classes of CRISPR-Cas systems and CRISPR spacers across phyla. The y-axis on the left shows the proportion of a particular class of CRIPR-Cas system within members of a phylum. Number of spacers (green line) is represented on the y-axis on the right.

11 1 Supplementary Note: 2 3 Additional example of a case where GEBA-I genomes with closely related reference genomes contribute a large number of novel genes The minimum 16S rrna gene distance for Promicromonospora kroppenstedtii (distance =0.005, Promicromonospora sukumoe) likely underestimates the actual evolutionary distance for this understudied lineage. We found evidence for legitimate differences between orthologous genes, such as the insertion of a homing endonuclease into the primary replicative DNA helicase of P. kroppenstedtii (IMG Gene ID: ) compared to P. sukumoe (IMG Gene ID: ). Other caveats such as differences in gene prediction or partial genes arising from fragmented assembly of either genome, could contribute to this observation. Alternative explanations for outliers include novel gene fusion events, genome-specific expansions of certain gene families, and horizontal gene transfer. 14 Additional examples of GEBA-I metagenomic recruiters Other hydrocarbon-impacted (e.,g coal, oil, tailings) environmental recruiters include Gemmobacter nectariphilus, Flavobacterium sasangense, Acinetobacter towneri, Thermomonas fusca, and others, that might not have been previously recognized. Stenotrophomonas maltophilia ATCC (a plant isolate) represents a relatively cosmopolitan species, captured in over 15 samples and habitats ranging from air to aquatic to rhizosphere, however this might be a possible artifact since S. maltophila has been reported as a contaminant of laboratory reagents 73.

12 22 Detection and classification of CRISPR-Cas systems across GEBA-I genomes We used sequence similarity methods against the specific curated cas gene models from TIGRFAMs 74,75 to classify the different CRISPR-Cas systems identified across the GEBA-I genomes. Specifically, we searched for the presence of any of the signature genes of the three main CRISPR-Cas Types; cas3 (TIGR01587, TIGR01596, TIGR02562, TIGR02621, and TIGR03158), cas9 (TIGR01865 and TIGR03031), and cas10 (TIGR02577 and TIGR02578), for Type I, II, and III, respectively 76 (Supplementary Table 5-A). We also used the proposed signature genes for the two new putative CRISPR-Cas system types, csf1 (TIGR03114) and cpf1 (TIGR04330) for Types IV and V, respectively The CRISPR-Cas modules are prokaryotic adaptive immune systems present in the majority of archaea and in many bacteria that provide sequence-specific protection against foreign invasive elements (e.g. viruses and plasmids) We explored the presence of this microbial defense mechanism across the GEBA-I genomes and observed that overall 54.6% of them contained any of the signature genes for the three main CRISPR Types (Types I to III) or the two new putative Type-IV and Type-V systems (Fig. S8). Almost 70% of the GEBA-I genomes containing CRISPR-Cas systems were classified as Type-I, while 16% and 14% could be classified as Type-II and Type-III, respectively. The new Types-IV and Type-V were only present in 4 and 5 of the GEBA-I genomes (Supplementary Table 5-A). Both CRISPR Type-I and Type-III are widely distributed across different phyla in bacteria and archaea, whereas Type-II was exclusively found in bacterial phyla. All these observations are in agreement with previous reports 76,80.

13 45 CRISPR spacers identification in GEBA-I genomes A CRISPR-Cas spacer database of 31,627 sequences was created by using a modified version of the CRISPR Recognition Tool 81 (CRT) detailed in Huntemann, et al 57 against all GEBA-I genomes (Supplementary Table 5-B) and manually curated to remove ambiguous sequences (those with more than 2 undetermined nucleotides). In order to search for novel spacer sequences, all identified GEBA-I spacers were queried for matches against a database containing 700 thousand spacers from all isolate genomes deposited in the IMG system 15 using very stringent sequence match thresholds. We used the BLASTn-short function from the BLAST+ package 71 with parameters: e-value threshold of 1.0e-06, percent identity of 95%. Additionally, we used the same search parameters to query all GEBA-I spacer sequences against a list of 125,000 metagenomic viral contigs 43 to look for specific host-virus interactions After comparing all spacers derived from the GEBA-I genomes with all spacers from isolates in the IMG system, we observed that 91% of the GEBA-I spacers (representing over 28 thousand sequences) were previously unreported (Supplementary Table 5-B). This new spacer information provides insights about the foreign genetic content of microbial genomes. As an example, we compared the GEBA-I spacers against a list of 125 thousands metagenomic viral contigs from Paez-Espino et al., We obtained a list of 497 hits from 221 unique spacers to 301 metagenomic viruses (152 unique metagenomic viruses according to the viral grouping defined in Paez-Espino et al., 2016) (Supplementary Table 5-C) establishing a link between host and virus. Among all of these host-virus connections is relevant to mention that several spacers derived from a number of pathogenic microbial species (e.g. Capnocytophaga ochracea VPI 2845,

14 Prevotella bivia 653, Selenomonas sputigena, and Leptotrichia goodfellowii LB 57) that cause oral and skin infections in humans and animals. Therefore, the discovery of novel spacer sequences from isolate genomes can reveal unreported host-virus connections that could be exploited in biotechnological applications. 72 Other Notable Genomic Features Coriobacterium glomerans, isolated from the midgut of a red soldier bug, contains an unusually large fraction of genes dedicated to carbohydrate metabolism and transport. Over 22% of the 2.1 Mb gut symbiont genome is dedicated to carbohydrate metabolism, which is the highest among all sequenced Actinobacteria and the third highest among all sequenced genomes. The large proportion of sugar utilizing genes in C. glomerans is expected to play a pivotal role in providing essential nutrients to the insect host, which is unable to meet its dietary requirements on its own and displays retarded growth or increased mortality upon removal of its gut symbionts 82,83. Genes belonging to both PTS and ABC type transporters for xylose, fructose, lactose and cellobiose metabolism were identified in the C. glomerans genome; these genes were more similar to Firmicutes genes than to those of other Actinobacteria, suggesting lateral transfer as a possible mean of gene acquisition Genes for flagellar synthesis, assembly and regulation were detected in the Sphaerobacter thermophilus genome, making it the first identified Gram-positive Chloroflexi to possess genes for a complete flagellar apparatus co-localized on the chromosome. This finding was surprising since S. thermophilus cells are not known to be motile and flagellar structures have not been detected microscopically 84. Chloroflexi

15 flagellar genes have been reported in non-geba-i, Gram-negative Thermomicrobium roseum 85 where flagellar genes were detected on a megaplasmid, an indication of possible lateral gene transfer. Additionally, flagellar genes from both genomes do not appear to have homologs among other sequenced Chloroflexi genomes but share high homology to genes from Firmicutes Supplementary Tables: Supplementary Table 1. Summary of 1003 GEBA-I genomes Supplementary Table 2. Detailed information of GEBA-I only protein clusters (2-A) and singletons (2-B) Supplementary Table 3. Summary (3-A) and length (3-B) of predicted biosynthetic clusters in GEBA-I genomes Supplementary Table 4. Top metagenome recruiters among GEBA-I genomes that had over 200 CDS hits at >95% amino acid identity over 70% alignment length to an individual metagenome CDS. Supplementary Table 5. Classification and distribution of CRISPR-Cas systems across GEBA-I genomes. Signature genes of the three main CRISPR-Cas Types namely cas3, cas9 and cas10 (5-A), GEBA-I CRISPR-Cas spacer database (5-B) & hits from unique GEBA-I spacers to metagenomic viruses (5-C)

16 114 References (Supplementary Information): Salter, S. J. et al. Reagent and laboratory contamination can critically impact sequence-based microbiome analyses. BMC Biol. 12, 87 (2014). 74. Haft, D. H. et al. TIGRFAMs and Genome Properties in Nucleic Acids Res. 41, D (2013). 75. Haft, D. H., Selengut, J. D. & White, O. The TIGRFAMs database of protein families. Nucleic Acids Res. 31, (2003). 76. Makarova, K. S. et al. An updated evolutionary classification of CRISPR-Cas systems. Nat. Rev. Microbiol. 13, (2015). 77. Barrangou, R. & Marraffini, L. A. CRISPR-Cas systems: Prokaryotes upgrade to adaptive immunity. Mol. Cell 54, (2014). 78. Deveau, H., Garneau, J. E. & Moineau, S. CRISPR/Cas system and its role in phagebacteria interactions. Annu. Rev. Microbiol. 64, (2010). 79. Marraffini, L. A. & Sontheimer, E. J. CRISPR interference: RNA-directed adaptive immunity in bacteria and archaea. Nat. Rev. Genet. 11, (2010). 80. Makarova, K. S. et al. Evolution and classification of the CRISPR-Cas systems. Nat. Rev. Microbiol. 9, (2011). 81. Bland, C. et al. CRISPR recognition tool (CRT): a tool for automatic detection of clustered regularly interspaced palindromic repeats. BMC Bioinformatics 8, 209 (2007).

17 Haas, F. & König, H. Characterization of an anaerobic symbiont and the associated aerobic bacterial flora of Pyrrhocoris apterus (Heteroptera, Pyrrhocoridae). FEMS Microbiol Ecol 45, (1987). 83. Kaltenpoth, M., Winter, S. A. & Kleinhammer, A. Localization and transmission route of Coriobacterium glomerans, the endosymbiont of pyrrhocorid bugs. FEMS Microbiol. Ecol. 69, (2009). 84. Demharter, W., Hensel, R., Smida, J. & Stackebrandt, E. Sphaerobacter thermophilus gen. nov., sp. nov. A Deeply Rooting Member of the Actinomycetes Subdivision Isolated from Thermophilically Treated Sewage Sludge. Syst. Appl. Microbiol. 11, (1989). 85. Wu, D. et al. Complete Genome Sequence of the Aerobic CO-Oxidizing Thermophile Thermomicrobium roseum. PLOS ONE 4, e4207 (2009)