Metagenomic species profiling using universal phylogenetic marker genes

Similar documents
Carl Woese. Used 16S rrna to developed a method to Identify any bacterium, and discovered a novel domain of life

Supplementary Figure S1 A: receiver operating characteristic (ROC) curve plotting

Metagenomics of the Human Intestinal Tract

Introduction to taxonomic analysis of metagenomic amplicon and shotgun data with QIIME. Peter Sterk EBI Metagenomics Course 2014

Bioinformatic tools for metagenomic data analysis

Analysis of Microarray Data

Chromosome-scale scaffolding of de novo genome assemblies based on chromatin interactions. Supplementary Material

Alignment-free d2 oligonucleotide frequency dissimilarity measure improves prediction of hosts from metagenomically-derived viral sequences

A comprehensive assessment of RNA-seq accuracy, reproducibility and information content by the Sequencing Quality Control Consortium

Introduction to Microbial Community Analysis. Tommi Vatanen CS-E Statistical Genetics and Personalised Medicine

New Statistical Algorithms for Monitoring Gene Expression on GeneChip Probe Arrays

OMNIgene GUT stabilizes the microbiome profile at ambient temperature for 60 days and during transport

Functional analysis using EBI Metagenomics

Applying Regression Techniques For Predictive Analytics Paviya George Chemparathy

An introduction into 16S rrna gene sequencing analysis. Stefan Boers

UTAX accurately predicts taxonomy of marker gene sequences

Advisors: Prof. Louis T. Oliphant Computer Science Department, Hiram College.

Post-assembly Data Analysis

Simultaneous profiling of transcriptome and DNA methylome from a single cell

Methods for comparing multiple microbial communities. james robert white, October 1 st, 2007

INTRODUCTION A clear cultivation bias exists in microbial phylogenetics. As of 2010, half of all

Post-assembly Data Analysis

Phylogenetic methods for taxonomic profiling

Nature Genetics: doi: /ng Supplementary Figure 1

Nature Biotechnology: doi: /nbt Supplementary Figure 1

Infectious Disease Omics

Chang Xu Mohammad R Nezami Ranjbar Zhong Wu John DiCarlo Yexun Wang

A comparison of methods for differential expression analysis of RNA-seq data

Introduction to gene expression microarray data analysis

Outline. Analysis of Microarray Data. Most important design question. General experimental issues

Microbially Mediated Plant Salt Tolerance and Microbiome based Solutions for Saline Agriculture

Measurement and sampling

Nature Methods: doi: /nmeth.4396

Attachment 1. Categorical Summary of BMP Performance Data for Solids (TSS, TDS, and Turbidity) Contained in the International Stormwater BMP Database

Assigning Sequences to Taxa CMSC828G

Recent urbanization in China is correlated with a Westernized microbiome encoding increased virulence and antibiotic resistance genes

Rapidly regulated genes are intron poor

SHAMAN : SHiny Application for Metagenomic ANalysis

20. Alternative Ranking Using Criterium Decision Plus (Multi-attribute Utility Analysis)

Exploring Similarities of Conserved Domains/Motifs

Package GeneExpressionSignature

Appendix 5. Fox River Study Group Interim Monitoring Evaluation

Discover the Microbes Within: The Wolbachia Project. Bioinformatics Lab

Effective Applications Development, Maintenance and Support Benchmarking. John Ogilvie CEO ISBSG 7 March 2017

Oligonucleotide microarray data are not normally distributed

Applications of Next Generation Sequencing in Metagenomics Studies

PREDICTING PREVENTABLE ADVERSE EVENTS USING INTEGRATED SYSTEMS PHARMACOLOGY

Soil - Plasticity 2017 (72) PROFICIENCY TESTING PROGRAM REPORT

MicroRNA sequencing (mirnaseq)

RNA spike-in controls & analysis methods for trustworthy genome-scale measurements

A Propagation-based Algorithm for Inferring Gene-Disease Associations

Supplementary Fig. 1 related to Fig. 1 Clinical relevance of lncrna candidate

Integrated Benchmarking Moves from Uncertainty to Solutions

Gene-Level Analysis of Exon Array Data using Partek Genomics Suite 6.6

Cell Signaling Technology

a-dB. Code assigned:

RNA-Seq analysis using R: Differential expression and transcriptome assembly

Exploration, Normalization, Summaries, and Software for Affymetrix Probe Level Data

Predictive Modeling Using SAS Visual Statistics: Beyond the Prediction

Gene Expression Data Analysis (I)

Choon-Kong Yap 1, Birgit Eisenhaber 1, Frank Eisenhaber 1,2* and Wing-Cheong Wong 1*

Leveraging biological replicates to improve analysis in ChIP-seq experiments

Evaluation Report of

Contents 16S rrna SEQUENCING DATA ANALYSIS TUTORIAL WITH QIIME... 5

Supplementary Information: Consistent negative response of US crops to high temperatures in observations and crop models

Gene Prediction Chengwei Luo, Amanda McCook, Nadeem Bulsara, Phillip Lee, Neha Gupta, and Divya Anjan Kumar

UK GENDER PAY GAP REPORT 2018

Highly Confident Peptide Mapping of Protein Digests Using Agilent LC/Q TOFs

Nature Biotechnology: doi: /nbt Supplementary Figure 1. Number and length distributions of the inferred fosmids.

Microbial Diversity and Assessment (III) Spring, 2007 Guangyi Wang, Ph.D. POST103B

How to view Results with. Proteomics Shared Resource

Probe-Level Data Normalisation: RMA and GC-RMA Sam Robson Images courtesy of Neil Ward, European Application Engineer, Agilent Technologies.

Chapter 5 Evaluating Classification & Predictive Performance

FEATURE-LEVEL EXPLORATION OF THE CHOE ET AL. AFFYMETRIX GENECHIP CONTROL DATASET

Sample Energy Benchmarking Report

(SHOTGUN) METAGENOMICS. Hélène Touzet, CNRS, CRIStAL

less sensitive than RNA-seq but more robust analysis pipelines expensive but quantitiatve standard but typically not high throughput

Gene Prediction Chengwei Luo, Amanda McCook, Nadeem Bulsara, Phillip Lee, Neha Gupta, and Divya Anjan Kumar

Expression summarization

DiScRIBinATE: a rapid method for accurate taxonomic classification of metagenomic sequences

Compendium of Immune Signatures Identifies Conserved and Species-Specific Biology in Response to Inflammation

Morningstar Direct SM Scorecard

Variant calling in NGS experiments

Optimization-Based Peptide Mass Fingerprinting for Protein Mixture Identification

Introduction to OTU Clustering. Susan Huse August 4, 2016

Welcome to Module 2. In this module, we will be providing an overview of the TransCelerate Methodology process and toolkit, comparing off-site,

Microarray Analysis of Gene Expression in Huntington's Disease Peripheral Blood - a Platform Comparison. CodeLink compatible

Human Microbiome Project: First Map of the World Within Us. Hsin-Jung Joyce Wu "Microbiota and man: the story about us

Exploratory Data Analysis

Axiom mydesign Custom Array design guide for human genotyping applications

Nature Methods: doi: /nmeth Supplementary Figure 1

Computational aspects of ncrna research. Mihaela Zavolan Biozentrum, Basel Swiss Institute of Bioinformatics

De Novo Assembly of High-throughput Short Read Sequences

Retail Sales Forecasting at Walmart. Brian Seaman WalmartLabs

Unravelling Airbnb Predicting Price for New Listing

Lees J.A., Vehkala M. et al., 2016 In Review

CS 5984: Application of Basic Clustering Algorithms to Find Expression Modules in Cancer

A rapid marker ordering approach for high density genetic linkage maps in experimental autotetraploid populations using multidimensional scaling

Supplementary Figure 1. Comparisons of actual tree-ring chronologies (solid line) with

RNA Sequencing Analyses & Mapping Uncertainty

Transcription:

Metagenomic species profiling using universal phylogenetic marker genes Shinichi Sunagawa, Daniel R. Mende, Georg Zeller, Fernando Izquierdo-Carrasco, Simon A. Berger, Jens Roat Kultima, Luis Pedro Coelho, Manimozhiyan Arumugam, Julien Tap, Henrik Bjørn Nielsen, Simon Rasmussen, Søren Brunak, Oluf Pedersen, Francisco Guarner, Willem M. de Vos, Jun Wang, Junhua Li, Joël Doré, S. Dusko Ehrlich, Alexandros Stamatakis, Peer Bork Supplementary Item Supplementary Figure 1 Supplementary Figure 2 Supplementary Figure 3 Supplementary Figure 4 Supplementary Figure 5 Supplementary Figure 6 Supplementary Figure 7 Supplementary Figure 8 Supplementary Table 4 Supplementary Table 5 Supplementary Table 7 Title Accuracy of species-level motu clustering. Linkage of motus of common origin. False-discovery rate over the process of the motu linkage procedure. Abundance consistency within motu linkage groups. GC-content consistency within motu linkage groups. Abundance of motu linkage groups across samples. Ranked abundance and prevalence of motu-lgs Performance of relative species abundance estimations. Sequence identity cutoffs used for clustering marker genes into motus and ambiguous alignment rates for motus of each marker gene. Phylogenomic representation of motus. Summary of using different distance metrics for community similarity analysis.

Supplementary Figure 1: Accuracy of species-level motu clustering. Clustering accuracy was assessed by testing whether motus with at least two MGs or 16S rrna genes that originated from a type strain reference genome were consistent regarding the taxonomic annotation of their members at the species level according to the NCBI taxonomy. According to this information, false discovery rates (FDRs) and recall values were calculated for all motus. Precision (1 - FDR) and recall can take values between 0 and 1, with high values indicating a good agreement of motu cluster members with the NCBI taxonomy.

Proportion of motu LGs 1.0 0.8 0.6 0.4 Taxonomic consistency of motu-lgs completely consistent majority consistent inconsistent 0.2 0.0 Kingdom Phylum Class Order Family Genus Species Species cluster [Mende 2013] Supplementary Figure 2: Linkage of motus of common origin. Accuracy of linking motus to motu linkage groups of common origin was assessed by calculating the consistency of taxonomic information available for motu linkage groups (motu-lgs) whose members originated from annotated motu RefMeta. motu-lgs in which more than 50% of individual motu taxonomic annotations agreed were labelled as "majority consistent" (see legend). The last column shows results for species-levels clusters described in Mende et al., 2013.

Linking FDR (FP / (TP+FP)) 0.000 0.005 0.010 0.015 0.020 0.025 motu-lgs resulted from early stopping after 40,889 steps, FDR = 0.01 0 20,000 40,000 60,000 80,000 100,000 Steps of the motu linking procedure Supplementary Figure 3: False-discovery rate over the process of the motu linkage procedure. The false discovery rate (FDR) is defined as the proportion of false positives (FP) among all positive predictions, that is the sum of FP and true positives (TP). Whenever two motus that are taxonomically annotated were linked by the algorithm, it was evaluated whether annotations agree (TP) or not (FP) and the FDR was updated accordingly. Before the FDR exceeded 0.01 (vertical line) after 40,889 linking steps, motu linkage was terminated.

Range of relative abundances of motu-lg members 1E 6 1E 5 1E 4 0.001 0.01 0.1 1 Full range (all motu-lg members) Inter-quartile range (least extreme 50% of members) 1E 6 1E 5 1E 4 0.001 0.01 0.1 1 Mean relative abundance of each motu-lg Supplementary Figure 4: Abundance consistency within motu linkage groups. Relative abundance estimates of individual motus plotted as a function of the relative abundance of the motu linkage group they belong to (estimated by their median). The full range across all linkage group members as well as the inter-quartile range are shown for each motu linkage group (see legend). Note that in most cases the abundance estimates of individual cluster members are very close to the cluster estimate (median) indicating robustness of these estimates. Note further that motu abundances were not used for cluster construction, only their correlation across samples (Spearman correlation, which is transformationinvariant).

GC-content range of motu-lg members 0.2 0.3 0.4 0.5 0.6 0.7 Full range (all motu-lg members) Inter-quartile range (least extreme 50% of members) 0.2 0.3 0.4 0.5 0.6 0.7 Median GC content of each motu-lg Supplementary Figure 5: GC-content consistency within motu linkage groups. GC-content of individual motus plotted as a function of the mean GC content of the motu linkage group that contains them. Shown are full ranges as well as inter-quartile ranges (see legend). Homogeneity of GC content within a motu linkage group suggests that the contained genes likely belong to the same species, because within a genome and species, GC content generally varies much less than between different species.

1.0 Total relative abundance of motu LGs 0.8 0.6 0.4 0.2 0.0 COG0012 COG0016 COG0018 COG0172 COG0215 COG0495 COG0525 COG0533 COG0541 COG0552 mean Supplementary Figure 6: Abundance of motu linkage groups across samples. Total relative abundance of motus belonging to motu linkage groups summarized across samples by boxplots (thick line corresponds to the median, boxes delineate the inter-quartile range (IQR) with whiskers extending to 1.5-fold IQRs, and circles indicate outliers). First ten columns show total relative abundance separately for each marker gene, while the rightmost column summarizes the distribution of per-sample averages.

Supplementary Figure 7: Ranked abundance and prevalence of motu-lgs. Relative abundances of the 30 most dominant (ranked by mean sample abundance), annotated and novel motu linkage groups (Supplementary Table 6) that represent prokaryotic species clusters are shown as blue and red boxplots, respectively. Grey bar plots show the prevalence of these species clusters in 252 human gut metagenomic samples.performance of relative species abundance estimations.

Supplementary Figure 8: Performance of relative species abundance estimations. Observed relative species abundances using motu-lgs and MetaPhlAn (species-level) were benchmarked against expected values based on a simulated human faecal metagenome. Pearson correlations are shown for logtransformed relative abundances.

Supplementary Table 4: Sequence identity cutoffs used for clustering marker genes into motus and ambiguous alignment rates for motus of each marker gene. For each of 40 universal single-copy marker genes (MGs), sequence identity cutoffs for clustering into motus are shown. Based on motu abundance profiles of 252 metagenomic samples (Supplementary Table 2) we calculated for each MG the fraction of reads that were mapped to more than one motu (ambiguous alignment rates). MGs were selected for inclusion in this study if the ambiguous alignment rate was below 7% and the false discovery rate (Supplementary Table 3) below 10%. FDR - false discovery rate. Summary COG Mean concatenated length in 3496 genomes Mean identification FDR (%) 10 selected MGs 15529 1.42 3.5% Selected genes Marker Gene Mean length in 3496 genomes Calibrated clustering cutoffs (%) Mean ambiguous alignment rate (median of 252 samples) Identification FDR (%) COG0012 1099 94.8 1.06 1.0% COG0016 1058 95.8 0.14 3.4% COG0018 1721 94.2 3.22 1.5% COG0172 1285 94.4 3.79 3.6% COG0215 1415 95.4 2.74 6.4% COG0495 2571 96.4 1.87 5.5% COG0525 2722 95.3 0.72 5.2% COG0533 1054 93.1 0.43 0.9% COG0541 1415 96.1 0.12 5.4% COG0552 1189 94.5 0.12 2.5% Excluded genes Marker Gene Mean length in 3496 genomes Calibrated clustering cutoffs (%) Identification FDR (%) COG0048 391 98.4 0.18 30.1% COG0049 476 98.7 0.15 18.5% COG0052 782 97.2 0.20 10.1% COG0080 433 98.6 0.43 26.8% COG0081 696 98 0.23 15.7% COG0085 3828 97 48.00 6.1% COG0087 662 99 0.23 31.8% COG0088 639 99 0.20 41.4% COG0090 825 98.8 0.17 29.6% COG0091 370 99.2 0.58 42.3% COG0092 712 99.2 0.23 47.0% COG0093 370 99 0.29 42.8% COG0094 547 99 0.34 42.0% COG0096 397 98.6 0.09 29.6% COG0097 538 98.4 0.14 23.3% COG0098 538 98.7 0.14 33.2% COG0099 372 98.9 0.12 27.8% COG0100 393 99 0.12 46.5% COG0102 444 99.1 1.63 46.0% COG0103 415 98.4 0.32 45.2% COG0124 1306 94.5 9.67 3.0% COG0184 278 98.2 0.44 26.1% COG0185 282 99.3 0.20 51.4% COG0186 268 99.3 0.35 58.4% COG0197 425 99.3 0.12 44.6% COG0200 446 98.4 0.14 25.3% COG0201 1322 97.2 1.96 7.9% COG0202 977 98.4 1.67 10.6% COG0256 369 99 0.12 43.3% COG0522 610 98.6 2.55 26.5% 16S rdna 1452 98.8 NA 41.09% Ambiguous alignment rate (median of 252 samples) Ambiguous alignment rate (median of 252 samples)

Supplementary Table 5: Phylogenomic representation of motus. motus were defined as motu(refmeta) or motu(meta) depending on whether genes within a motu cluster were identical or similar (below the MG-specific sequence identity cutoff shown in Supplementary Table 4) to a reference marker gene sequence or only found in a metagenome, respectively. The number of motus and the relative abundance per sample was calculated and broken down by motu type (RefMeta or Meta). detected in 252 fecal samples motu (Meta) clusters fraction of detected motus (%) relative abundance (%) motu (RefMeta) clusters fraction of detected motus (%) relative abundance (%) COG All motus COG0012 2126 739 437 59.1 42.7 302 40.9 57.3 COG0016 2152 748 433 57.9 42.9 315 42.1 57.1 COG0018 2069 610 386 63.3 40.8 224 36.7 59.2 COG0172 2098 716 417 58.2 43.3 299 41.8 56.7 COG0215 2193 757 433 57.2 45.8 324 42.8 54.2 COG0495 2192 669 382 57.1 41.1 287 42.9 58.9 COG0525 2131 670 370 55.2 42.7 300 44.8 57.3 COG0533 2087 668 392 58.7 42.5 276 41.3 57.5 COG0541 2143 715 403 56.4 43.2 312 43.6 56.8 COG0552 2151 718 412 57.4 42.6 306 42.6 57.5 mean 2134 701 407 58.0 42.7 295 42.0 57.3 stdev 41 46 24 2.2 1.3 28 2.2 1.3 Supplementary Table 7: Summary of using different distance metrics for community similarity analysis. Abundance data were normalized by total abundance and a subset was additionally log10. Six different distance metrics were used to calculate community similarities across 249 samples including 88 samples from 41 individuals that were sampled more than once. For these 88 samples, we calculated the percentage of instances when the most similar sample from one individual was collected from the same individual at a different time point. Distance metric Log transformed relative abundances? motu-linkage groups (%) MetaPhlAn (%) RefMG (%) Mean across methods (%) Euclidean Yes 97.7 86.4 92.0 92.0 Horn-Morisita Yes 97.7 87.5 89.8 91.7 Bray-Curtis Yes 98.9 81.8 85.2 88.6 Jensen-Shannon Yes 97.7 80.7 84.1 87.5 Gower Yes 95.5 76.1 75.0 82.2 Spearman No 85.2 77.3 67.0 76.5 Spearman Yes 85.2 77.3 67.0 76.5 Jensen-Shannon No 80.7 67.0 70.5 72.7 Bray-Curtis No 68.2 51.1 47.7 55.7 Gower No 62.5 47.7 45.5 51.9 Horn-Morisita No 39.8 36.4 23.9 33.3 Euclidean No 38.6 33.0 26.1 32.6