Tutorial Estimating taxonomic and functional diversity from a set of mammal gut samples

Size: px
Start display at page:

Download "Tutorial Estimating taxonomic and functional diversity from a set of mammal gut samples"

Transcription

1 Tutorial Estimating taxonomic and functional diversity from a set of mammal gut samples Objective: This tutorial will introduce you to the basic steps in calculating and comparing the taxonomic and functional diversity of a number of samples using the 16S rrna gene as a proxy for biodiversity, and using metagenomic data to assess function. Recall that alpha-diversity expresses the diversity of a single sample, while beta-diversity is a measure of the dissimilarity between a pair of samples. In both cases, "traditional" or "nonphylogenetic" diversity measures assign entities to discrete groups defined at some threshold, with no consideration of the relatedness of entities within or between groups. Phylogenetic diversity measures consider the relatedness of entities at all levels, and do not require assignment to specific groups (although clustering may still be carried out for practical reasons). In Part I we will map 16S sequences to taxonomic groups, which will allow us to compare the relative abundance of groups across different types of sample. In Part II we will cluster sequences into operational taxonomic units (OTUs) and assess whether different types of sample tend to be more or less diverse. In Part III we will compute the phylogenetic beta-diversity of our samples to see whether samples of a particular type tend to be more similar to one another than to other types of sample. In Part IV we will consider the metagenomes of each sample to determine whether significant functional differences exist. About the Dataset The microbes present in a wide range of mammals were investigated in a recent paper: Muegge BD, Kuczynski J, Knights D, Clemente JC, González A, Fontana L, Henrissat B, Knight R, Gordon JI (2011) Diet drives convergence in gut microbiome functions across mammalian phylogeny and within humans. Science 332: PMID The authors of this study collected sequences from 33 mammalian species representing a wide range of taxonomic orders and digestive physiologies. Both 16S profiling (based on the V2 hypervariable region) and whole-genome shotgun sequencing were used to investigate the relationship between taxonomy, function, diet and physiology. The data were obtained from the MG-RAST server (16S: WGS: and processed using the mothur 1.28 package of Patrick Schloss (Schloss et al., 2009; Although we will be investigating similar 1

2 questions to those of the original authors and starting with their original quality-filtered data set, the workflow I use differs from theirs in a few ways. Critically, I used mothur instead of QIIME (Caporaso et al., 2010), and did not use an alignment mask to delete hypervariable columns from the 16S alignment. This means that our sequences will be more dissimilar than those used in the original paper, and we will have more OTUs at 97% or any other level of identity. I also ran a second pass for chimeras in mothur, which removed a few thousand more sequences from the set. Sadly, for technical reasons one of the most interesting animals (Echidna, a carnivore and the only monotreme in the data set) was left out of the diversity analysis. The MG-RAST server (Meyer et al., 2008; has some nice functions to let you browse datasets, carry out analyses, and visualize results. In particular, if you click through any of the samples, you will get a wide range of taxonomic, alpha-diversity and other summaries. I encourage you to have a look and try some things out (some features require you to create an account), but the tutorial does not depend on use of the website. Due to time and computing constraints, some of the following steps have been carried out in advance. However, if you would like to do the whole thing yourself I have indicated the commands you can use to recreate the analysis. Part Zero: First Steps (the 16S story) The sequences were acquired from the appropriate project page at MG-RAST. The Schloss standard operating procedure ( is a useful guide to analysis; the sequences were already past the quality-checking stage, so these steps did not need to be carried out. I carried out these steps beforehand save you the tedium of entering several prescribed commands, which allows us to skip to the important part. A couple of the steps are also computationally intensive, so you're spared the need to work through the night. Here is a brief summary of what I did: (1) Concatenate all the sequence files into a single, large fasta-format file (2) Align the sequences using a reference template (in this case the SILVA full-length bacterial template available from (3) Trim the alignment using 'screen.seqs' to eliminate reads that ended up in the wrong part of the alignment (4) Screen chimeras using UCHIME (chimera.uchime). This yielded the canonical alignment and set of sequences that will be used in several ways. (5) Compute sequence distances based on the alignment, ignoring gaps. (6) Cluster sequences into OTUs using the 'nearest neighbor' approach. These OTUs serve as the basis for both the alpha-diversity analysis (Part II) and the non-phylogenetic beta-diversity analysis (Part III). 2

3 Part I: Relative abundance of groups In Part I we will actually try to assign taxonomic information to our 16S sequences using the Ribosomal Database Project classifier. In brief, this method summarizes each reference 16S sequence (with known taxonomy) as a series of 'words' or k-mers of a particular length to model the composition of the reference database. Each read is then summarized in a similar way, with its profile compared to the reference database to find the best match. The comparison is performed using a Naive Bayes approach, and bootstrapping is used to generate confidence scores between 0 and 100. In general scores < 60 are not to be trusted, and even higher scores can miss the mark. Since we are classifying short fragments of a full-length gene, we don't have a huge amount of information to go on and may expect many assignments to be of low confidence. The reference database I used is the SILVA bacterial sequence set, with a total of 14,956 sequences (see Assignment was performed with the 'classify.seqs' command, using a fairly stringent confidence cutoff of 80%. Here is a summary of the percentage of sequences from each sample that were classified to the genus and phylum level using this approach: In most cases, fewer than 25% of sequences were classified at the genus level, but over 70% of sequences were classified at the phylum level with the exception of Goldie's Marmoset (an omnivore). 3

4 Very high assignment at the genus level was obtained for black bear, black lemur, polar bear, spectacled bear, and squirrel; in most cases these samples were super-dominated by a single genus. Looking at the taxonomic profiles gives some crazy-looking results, but these seem to be consistent with Table S2 from the original paper, which shows very few OTUs and low diversity for these samples. Part I activity - statistical analysis of taxonomic distributions. You will load taxonomic information into STAMP (Parks and Beiko, 2010; to perform comparisons of abundance between samples and groups of samples. There are three relevant files: - MammalMetadata.txt includes information about each sample, including name, diet and digestive type. - phylum.tsv summarizes taxonomic information by sample at the phylum level. - genus.tsv gives the same information at the genus level. To load information into STAMP, first open the program, then choose "load data" from the File menu. "Profile file" will be either the genus or phylum-level information, while "Group metadata file" is MammalMetadata.txt. We can talk about how to use STAMP in the session, but the basic options can be seen on the screen in the Properties frame: - Comparison type tab: Multiple groups, two groups or two samples. - Profile (for two groups or two samples): which samples will be compared? - Statistical properties: choose the type of statistical test and multiple test correction (including None). - Filtering: show only features that satisfy a particular criterion? You should see a small tab right under the default PCA plot that appears. The best way to visualize differences between individual things (taxonomic groups, OTUs, or functions) is to choose the 'post-hoc plot' option. This will show you a list of features just to the right of the initially empty plot; pick any category to see the plots. You can experiment by changing the type of test, and the choice of plot. Consider the following questions: - Do different diet types or physiologies tend to form clusters when PCA is applied? - Are there any significant differences between foregut and hindgut-fermenting carnivores? - Which genera or phyla tend to distinguish different groups? - Do duplicate members of the same species have similar taxonomic profiles? 4

5 Part II: Alpha diversity of OTUs From Part Zero above we have a list of OTUs and a mapping of every non-chimeric sequence to an OTU. In the hopes that we see some recurring sequences, we can plot the abundance of OTUs of different sizes. The graph below shows the number of OTUs with various numbers of sequences: So there are 16,344 OTUs with one sequence (singleton OTUs) and one OTU with 14,791 sequences a beautiful almost-symmetry if there ever was one. Those singletons could represent a few things: unfiltered chimeras, misaligned sequences and sequencing errors (although the 3% OTU threshold should suck up a lot of these into larger groups). Less pessimistically, many of these may be rare organisms that happened to be taking a tour of the gut when the Feces Express came barreling through. These can potentially have a huge impact on calculated diversity, depending on what calculations we use. Our starting file is "mammals.otutable.txt", which summarizes the count of each OTU in each sample. Since the number of inferred OTUs is large, the file is too wide to completely open in Excel, but the most abundant OTUs come first so these are not cut off. Pretty much the simplest measure we can consider is taxon richness, the count of distinct things (here, OTUs) in a given setting. Of course this measure is going to depend on our sampling effort, and it is very unlikely that we will observe everything. So we need some sort of equation to estimate the richness based on our sampling effort. One way to do this is with the Chao1 statistic, which considers the number 5

6 of "rare" things (singleton and doubleton OTUs) in building an estimate of how many things are really there. Part II activity Step 1: Let's fire up mothur to take a closer look at the OTU table. mothur commands all follow the basic pattern command(option1=moo,option2=baa,option3=woop) and are pretty straightforward. Be forewarned that mothur generates a ton of files with very long names. Try this command: mothur > collect.single(shared=mammals.otutable.txt,calc=chao,freq=10) which will show the effect of increasing sample size on estimated richness. This command generates a file for each sample of the form "mammals.otutable.capybara.chao", with four columns representing the sequencing effort, estimated number of OTUs, and the upper and lower confidence intervals on that estimate based on the assumption of a lognormal distribution. We can compare two distributions by plotting a line graph in (say) Excel, including CIs to determine if confidence intervals overlap. Give this a try for any pair of samples. - Based on our estimates of richness, do you think we are close to sampling to saturation? What might happen if we mask hypervariable alignment columns? - Do your two samples have different richness? If we want to compare many samples, we should choose a sampling effort equal to or less than that of the smallest sample (unless we want to lose samples). Which sample is dragging us down in this study? In this case, the goat is the hyena, with just over 1600 sequences. So let's grab the 1600 line from each file using the following UNIX command: grep ^1600 *.chao sort -k2 -n Feel free to ask what this command is actually doing if you're not sure. You can either copy & paste the output or redirect to a file for handy comparison in a graphing program. Take a look and see if there are any interesting trends, outliers, and so on. - Which animals have the lowest expected richness? Are you surprised by this? - Who has the highest expected richnes? - Is there any consistency by diet? Richness only considers the count, and not the relative abundance (i.e., the evenness) of OTUs in the sample. Let's try the above again using Shannon diversity instead of Chao1: collect.single(shared=mammals.otutable.txt,calc=shannon) We can examine and compare results in a similar manner as for the Chao1 statistic. 6

7 - Are the Shannon estimates more or less robust to sampling effort as compared with Chao1? Why or why not? - Is there any substantial change in the ordering of animals when we switch from richness to diversity? Part III: Sample beta diversity Taxonomic summaries can be useful because they put names to sequences. OTUs are less arbitrarily defined than taxonomic groups, reflecting (sort of) a fixed about of 16S sequence diversity. But different OTUs are related to different degrees, and treating OTUs as completely discrete entities does take this relatedness into account. The best way to get around this is through phylogenetic diversity, which can consider degrees of relatedness between sequences (while potentially using representatives from OTUs to simplify the analysis). Here we will consider beta diversity, the similarity between different samples. One interesting feature of beta-diversity that was explored by my former student Donovan Parks is the similarity between certain phylogenetic and non-phylogenetic measures. Here we will compare the patterns we observe when we use non-phylogenetic (Bray-Curtis) and phylogenetic (normalized weighted UniFrac) methods to infer beta-diversity. We will use Donovan's Express Beta Diversity (EBD) software ( also the basis for Parks and Beiko, 2013) to calculate both of these measures. Again I did some precooking to save you time. First, I used the NAST alignment as the basis for construction of a phylogenetic tree using FastTree (Price et al., 2010). It would be a mistake to use this tree as any sort of proxy for prokaryotic evolution (we can discuss the many reasons why) but it should suffice for our diversity calculations. Second, I build the input files that EBD requires to associate leaves in the tree with sample counts. In the nonphylogenetic case, we again need to define groups and will use the same 97% OTUs that were generated above. Since Bray-Curtis is a quantitative measure that takes into account the relative abundance of different OTUs, the huge number of singleton OTUs should not be a problem, although distances may nonetheless be large (i.e., close to 1.0). The formula for Bray-Curtis dissimilarity between any two samples i and j is: d i. j xik x jk k xik x jk k where each k is a different OTU, and x ik is the count of OTU k in sample i. If x ik = x jk for all k, then the dissimilarity will be zero. The file "mammal.otu.counts" contains a formatted list of all counts. Use this file as input to EBD with the following command: 7

8 ExpressBetaDiversity -s mammal.otu.counts -d bc.dst -c Bray-Curtis This will generate a PHYLIP-formatted distance matrix that can be used by mothur to perform ordination and hierarchical clustering. Take a look at the entries in the distance matrix they are all very large! In fact, our friends from above the black bear and spectacled bear have dissimilarities of 1.0 with several other groups they share no OTUs in common, at least not in the limit of precision. Run mothur again, and we can perform a principal coordinate analysis of this distance matrix to see whether clusters emerge. Use the command Mothur > pcoa(phylip=bc.dst) and you will get an output file called "bc.dst.pcoa.axes" that gives the values for each sample in each principal coordinate. These coordinates can be viewed in Excel, R, etc. If you want to use R, you can use the following set of commands: R> axes = read.table("bc.pcoa.axes", header=t) R> attach(axes) R> plot(axis1,axis2) R> text(axis1,axis2,labels=group) This will show you a scatterplot of the first two axes, with labels corresponding to animal names. You can try other axes as well, but the first two explain the greatest amount of covariance in the distance matrix. - Are any trends evident from this plot? Are we seeing clusters emerge according to mammal taxonomy or diet? Finally we will consider the use of a phylogenetic beta-diversity measure, weighted normalized UniFrac (Lozupone et al., 2007), that is a phylogenetic extension of the Bray-Curtis measure. Since both relative abundance and degree of relatedness will now be considered, we expect our black and spectacled friends to have dissimilarity values less than 1.0. EBD needs two input files to perform a phylogenetic beta-diversity analysis: a tree relating the various sequences, and an input file showing how many times each leaf of the tree appears in each sample. I have created these two files for you: the tree is "mammals.unique.tree", and the frequency file is "mammal.leaf.counts". The command for EBD is: ExpressBetaDiversity -s mammal.leaf.counts -t mammals.unique.tree -d NWU.dst -c Bray-Curtis 8

9 This will create a new Phylip-formatted distance matrix file "NWU.dst". With this in hand, you can carry out a similar analysis in mothur and R as you did above. - Are the clustering patterns similar between the phylogenetic and non-phylogenetic approaches? Are there any striking differences in the placement of particular animals? Of course, principal coordinate analysis isn't the only thing you can do with a distance matrix this matrix is the launching point for many different kinds of adventure. Another fun thing to do is cluster the matrix using the UPGMA algorithm, again using mothur: mothur > tree.shared(phylip=nwu.dst) which will generate a rooted tree that shows the inferred hierarchical similarity patterns. It's in Newick format, so any old tree viewer should work for you. - As before, do the clusters we get from this analysis make any taxonomic or physiological sense? Part IV: Functional analysis In the last part of the tutorial, we will take a brief look at the functional genes that were identified through shotgun sequencing of the same mammal samples. As mentioned above, the original source for the data is MG-RAST ( and MG-RAST makes available both the DNA sequences and a series of analyses that were carried out on these. The MG- RAST system offers a wide range of precooked taxonomic and functional assignments, but these are very often based on simple homology matching and both taxonomy and function are likely to contain many incorrect calls. If you're looking to perform broad comparisons of many datasets this may be OK, but if you're really concerned about specific functions then you should carry out your own analysis with specialized tools. Since we are time-limited here, however, we will focus on a couple of straightforward analyses based on the MG-RAST annotations. The basic questions we will ask are: - Are there specific classes of function that distinguish different sets of mammals? Do these make sense in light of known physiological differences? - If we cluster mammal samples based on their functional profiles, are these clusters similar to those obtained with 16S or do different patterns emerge? To build the datasets, I first acquired the files from ftp://ftp.metagenomics.anl.gov/projects/116/*/processed/, where * is the identifier for each of the metagenome samples (for instance, ). The files that start with '900' and '999' are of particular interest as they contain taxonomic and functional predictions. We will focus on the contents of the files 9

10 '900.abundance.function' and '999.done.Subsystems.stats', as these contain both detailed and summary information about the predicted functions of the metagenomes. The.function file is cross-referenced with several reference databases, including the SEED subsystems that are described in Overbeek et al. (2005) and available at The number of functional assignments is large (about 5000) and there is some redundancy, although KEGG and other annotations are worse. By contrast, the.subsystems.stats file is a much higher-level summary grouped into 28 functional categories. I wrote a short Perl script to collect these data for each of the mammal samples. The file that contained all ~5000 SEED functional assignments was large enough to break STAMP, so I instead subselected a couple of potentially interesting or uninteresting functional sets into separate files. - Before we start looking at results, do you have any ideas about what types of function might be different between carnivores, herbivores and omnivores? Here are the files we have to work with: - STAMP_Subsystems.txt: this is the high-level subsystem summary by sample, which covers all predictions at a high level. - STAMP_Carbohydrate.txt: all detailed functions with associated EC numbers that fall under the 'Carbohydrate' category. - STAMP_MembraneTransport.txt: as above, but for the 'Membrane Transport' category. - STAMP_ProteinMetabolism.txt: as above, but for the 'Protein Metabolism' category. - MammalMetagenomeMetadata.txt: the metadata file for STAMP that associates each mammal with its dietary physiology. The objective here is to load different functional summaries into STAMP to see what sort of clusters emerge, whether any functions show statistically significant differences, and whether these are at all interpretable in terms of possible host physiological differences. Focus on PCA and post-hoc plots as before. - Which sets give better or worse clustering of the mammal set? - What functions (at high and precise levels) show statistically significant differences? That's it! This is necessarily a basic overview of what sorts of things can be done with taxonomy and function, but hopefully serves as a useful basis for interpreting (and maybe trying!) some of the more advanced techniques you'll see applied in good metagenomic papers. References: 10

11 Caporaso JG, Kuczynski J, Stombaugh J, Bittinger K, Bushman FD, Costello EK, Fierer N, Peña AG, Goodrich JK, Gordon JI, Huttley GA, Kelley ST, Knights D, Koenig JE, Ley RE, Lozupone CA, McDonald D, Muegge BD, Pirrung M, Reeder J, Sevinsky JR, Turnbaugh PJ, Walters WA, Widmann J, Yatsunenko T, Zaneveld J, Knight R QIIME allows analysis of high-throughput community sequencing data. Nat Methods 7: Hughes JB, Hellmann JJ, Ricketts TH, Bohannan BJ Counting the uncountable: statistical approaches to estimating microbial diversity. Appl Environ Microbiol 67: Lozupone CA, Hamady M, Kelley ST, Knight R Quantitative and qualitative beta diversity measures lead to different insights into factors that structure microbial communities. Appl Environ Microbiol 73: Meyer F, Paarmann D, D'Souza M, Olson R, Glass EM, Kubal M, Paczian T, Rodriguez A, Stevens R, Wilke A, Wilkening J, Edwards RA The metagenomics RAST server - a public resource for the automatic phylogenetic and functional analysis of metagenomes. BMC Bioinformatics 9:386. Overbeek R, Begley T, Butler RM, Choudhuri JV, Chuang HY, Cohoon M, de Crécy-Lagard V, Diaz N, Disz T, Edwards R, Fonstein M, Frank ED, Gerdes S, Glass EM, Goesmann A, Hanson A, Iwata-Reuyl D, Jensen R, Jamshidi N, Krause L, Kubal M, Larsen N, Linke B, McHardy AC, Meyer F, Neuweger H, Olsen G, Olson R, Osterman A, Portnoy V, Pusch GD, Rodionov DA, Rückert C, Steiner J, Stevens R, Thiele I, Vassieva O, Ye Y, Zagnitko O, Vonstein V The subsystems approach to genome annotation and its use in the project to annotate 1000 genomes. Nucleic Acids Res 33: Parks DH, Beiko RG. Identifying biologically relevant differences between metagenomic communities. Bioinformatics 26: Parks DH, Beiko RG Measures of phylogenetic differentiation provide robust and complementary insights into microbial communities. ISME J 7: Price MN, Dehal PS, Arkin AP FastTree 2--approximately maximum-likelihood trees for large alignments. PLoS One 5:e9490. Schloss PD, Westcott SL, Ryabin T, Hall JR, Hartmann M, Hollister EB, Lesniewski RA, Oakley BB, Parks DH, Robinson CJ, Sahl JW, Stres B, Thallinger GG, Van Horn DJ, Weber CF Introducing mothur: open-source, platform-independent, community-supported software for describing and comparing microbial communities. Appl Environ Microbiol 75: