Contents 16S rrna SEQUENCING DATA ANALYSIS TUTORIAL WITH QIIME... 5

Size: px

Start display at page:

Download "Contents 16S rrna SEQUENCING DATA ANALYSIS TUTORIAL WITH QIIME... 5"

Jasmine Barrett
6 years ago
Views:

1 QIIME Analysis 1 Contents 16S rrna SEQUENCING DATA ANALYSIS TUTORIAL WITH QIIME... 5 Report Overview... 5 How to Obtain Microbiome Data... 6 How to Setup QIIME... 7 Essential files for QIIME... 7 Sequence File (.fna)... 8 Quality File (.qual)... 8 Mapping File... 9 Basic Statistics on Sequence Data Otu Picking Basic Statistics on OTU Table OTU Heatmap Data Analysis Summarize Communities by Taxonomic Composition Investigating Alpha Diversity Identifying Differentially Abundant OTUs Normalizing OTU Table Beta-diversity and PCoA Jackknifed Beta Diversity Analysis... 26

2 QIIME Analysis 2 Make Bootstrapped Tree Comparing Categories Conclusion REFERENCES... 32

3 QIIME Analysis 3 Tables and Figures Figure 1. FastaQ File Format... 8 Figure 2. Mothur output for sequence summary Figure 3. Summary for biom file Figure 4. rep_set_tax_assignments.txt Figure 5. Heatmap for HMP data Figure 6. Pie plot of the degree of sharing of microbial taxa in 14 collected samples from 7 different point with four months interval in a hospital room Figure 7. Area plot of the degree of sharing of microbial taxa in 14 collected samples from 7 different point with four months interval in a hospital room Figure 8. Bar plot of the degree of sharing of microbial taxa in 14 collected samples from 7 different point with four months interval in a hospital room Figure 9. Microbial composition of the microbial taxa in 14 collected samples Figure 10. Rarefraction Plot for date_s Figure 11. Rarefraction plot for sample_type_s Figure 12. Diff_otus.txt for Computer Mouse and Countertop Figure 13. MA plot for differential abundance of Computer Mouse and Countertop... 22

4 QIIME Analysis 4 Figure 14. Dispersion Estimate Plot for Differential Abundance of Computer Mouse and Countertop Figure 15. MA plot for Computer Mouse Samples Figure 16. Dispersion Estimate Plot for Computer Mouse Samples Figure 17. PCoA plot for the bacterial community collected in the hospital room. Community were characterized by samples collected in February and April. Bray-Curtis is used as distance metric Figure 18.PCoA plot for the bacterial community collected in the hospital room Figure 19. 3D PCoA Plots for HMP samples Figure 20. Distance Boxplot for Surface type Figure 21. Distance Comparison among surface types Figure 22. Jackknifed UPGMA clustering (using the weighted UniFrac metric) showing the similarity of bacterial communities based on 16S rrna genes

5 QIIME Analysis 5 16S rrna SEQUENCING DATA ANALYSIS TUTORIAL WITH QIIME Report Overview The rapid progress of that DNA sequencing techniques has changed the way of metagenomics research and data analysis techniques over the past few years. Sequencing of 16S rrna gene has become a relatively easy way to study microbial composition and diversity (Fierer et al., 2007). High-throughput bioinformatics analyses increasingly rely on pipeline frameworks to process sequence and metadata. Popular bioinformatics pipelines in the literature are QIIME, Mother and Uparse. In this study, QIIME (Quantitative Insights Into Microbial Ecology) (Caporaso et al., 2010), which is an open-source bioinformatics pipeline, is planned to use for performing microbiome analysis from raw DNA sequencing data. QIIME is designed to create quality graphics and statistics from raw sequencing data generated on the Illumina or other platforms. Typical QIIME analysis workflow is consisted of demultiplexing, quality filtering, clustering (OTU detection), chimera removal, taxonomic assignment, and phylogenetic reconstruction, and diversity analyses and visualizations. This document is organized as an introduction tutorial on how to analyze 16S sequencing data using current methods. During microbiome analysis, there are basic questions about microbiome data. The following questions were covered in this tutorial document: 1. Proportionally, what microbes are found in each sample community? 2. How many species are in each sample? 3. Are there species significantly more abundant in one set of samples than in another? 4. How much does diversity change between samples? 5. Do different sample groupings significantly differ in their microbial composition?

6 QIIME Analysis 6 This documents is structured as answer for these questions concerned so that each section is primarily concerned with how to find the answer to a particular question about the microbiome data. How to Obtain Microbiome Data The Sequence Read Archive (SRA) is a bioinformatics database that provides a public repository for DNA sequencing data obtained from next generation sequence (NGS) technology. Raw sequence data and metadata could be searched as well as downloaded for further downstream analysis. Biotechnology companies such as 454, IonTorrent, Illumina, SOLiD, Helicos and Complete Genomics, provide a line of products and services on sequencing, genotyping and gene expression. Illumina is one of the successful company that their technology reduced the cost of sequencing a human genome reasonable prices. Since Illumina will be used for our data sequencing purposes eventually in the project, 16s rrna data obtained Illumina system was searched from SRA database and Hospital Microbiome Project data obtained from the database. Every experiment in SRA database has an accession codes and metadata such as study abstract, experiment attributes and owner of the data. Raw sequence data related that experiment can be downloaded in fasta and fastaq format using accession codes. Hospital Microbiome Project (HMP) (Shogan et al., 2013) aims to collect microbial samples from surfaces, air, staff, and patients from the University of Chicago's new hospital pavilion, involving 10 patient rooms, 2 nursing stations, staff, water and air sampling, both daily and weekly during a year in order to better understand the factors that influence bacterial population development in health care environments.

7 QIIME Analysis 7 As a preliminary exploration, a small data set from HMP was analyzed. Data collected from seven different point (countertop, computer mouse, station phone, chair armrest, corridor floor, hot tap water faucet and cold tap water faucet) in the same room (S10) at two different time point (27/02/2013 and 17/04/2013) was used. How to Setup QIIME QIIME is a software package of python wrapper scripts and it can be downloaded and used on Linux system. It can also be used on Virtual Box with Windows operation system. I used QIIME version on Virtual box in Windows OS. Essential files for QIIME QIIME works with FASTAQ file format. A FASTQ file uses four lines per sequence. A typical sequence file in FASTAQ format as described below: Line 1 begins with a '@' character and is followed by a sequence identifier and an optional description. Line 2 is the raw sequence letters. Line 3 begins with a '+' character and is optionally followed by the same sequence identifier again as line 1. Line 4 encodes the quality values for the sequence in Line 2, and must contain the same number of symbols as letters in the sequence. FASTAQ format has sequence data as well as its quality data. QIIME has convert_fastaqual_fastq.py script in order to convert FASTQ data file as a qual file with for quality scores and fna file for sequence data.

QIIME Analysis 8 ( convert_fastaqual_fastq.py -f seqs.fastq -c fastq_to_fastaqual ) Figure 1. FastaQ File Format Sequence File (.fna) Sequence file shows the raw sequence data for each sequence.

8 QIIME Analysis 8 ( convert_fastaqual_fastq.py -f seqs.fastq -c fastq_to_fastaqual ) Figure 1. FastaQ File Format Sequence File (.fna) Sequence file shows the raw sequence data for each sequence. A typical sequence file in fna format as described below: Line 1 begins with a '>' character and is followed by an Accession Run Code. Line 2 is the raw sequence letters. Quality File (.qual) Quality file shows the quality scores for each sequence. A typical sequence file in qual format as described below: Line 1 begins with a '>' character and is followed by a Accession Run Code. Line 2 is the quality scores.

9 QIIME Analysis 9 Mapping File QIIME requires a metadata mapping file for most analysis. Mapping file is generated by user and contains all of the information, categorical or numeric, about the samples necessary to perform the data analysis. Excel or text file can be used to create mapping file. It should be tabdelimited. Mapping file is important because it links sample identifier with its metadata. In a typical mapping file, each line refers to a specific sample data. Line starts with a SampleID, the BarcodeSequence used for each sample, the LinkerPrimerSequence used to amplify the sample, and ends with a description column. First column should be SampleID and sampleid could have any alphanumeric characters and periods, cannot have underscores. SampleID should refer to the sequence headers used in FASTA files. Moreover, any metadata that relates to the samples and any additional information relating to specific samples that may be useful to have at hand when considering outliers. The last column must be Description. In some circumstances, users may need to generate a mapping file that does not contain barcodes and/or primers. To generate such a mapping file, fields for Barcode Sequence and LinkerPrimerSequence can be left empty. In order to check whether created mapping file is in the right format validate_mapping_file.py is implemented in QIIME. This script tests many problems in the mapping file and a _corrected.txt form of the mapping file is generated in output folder. If BarcodeSequence and LinkerPrimerSequence fields are empty, then barcode and primer testing need to be disabled with the -p and -b parameters. validate_mapping_file.py -m <mapping_filepath> -o <outputpath> -p b

QIIME Analysis 10 Basic Statistics on Sequence Data count_seqs.py -i <sequence_file.fna> script is implemented in QIIME to count sequences and calculate sequence length mean and standard deviation.

10 QIIME Analysis 10 Basic Statistics on Sequence Data count_seqs.py -i <sequence_file.fna> script is implemented in QIIME to count sequences and calculate sequence length mean and standard deviation. Our file had total sequence, 151 sequence length mean and 0 standard deviation. Mothur gives more detailed statistics such as min, max, median and quartiles. Running summary.seqs(fasta=<sequence_file.fna>) command, the following screen is displayed and summary output file created. Figure 2. Mothur output for sequence summary Otu Picking Picking OTUs is called "clustering" as sequences with some threshold of identity are clustered together to into an OTU. There are three different methods for OTU picking: De novo Clustering Closed-reference Open-reference

11 QIIME Analysis 11 The answer to which method to choose is depend on what is known about the microbiome community priori. If the studied microbial community is well studied, then 16S databases has many representatives and closed reference otu picking strategy is suitable. De novo method is suitable to discover new species. Open reference method is combined of two methods, closed and de novo method, and is highly suggested method by QIIME developers. First it clusters sequences against a database of 16S references sequences called greengenes, then uses de novo clustering on those sequences which are not similar to the reference sequences. Table 1. Which OTU picking strategies in which study? OTU Picking Strategies Closed reference pick_closed_reference_otus.py De novo pick_de_novo_otus.py Open reference pick_open_reference_otus.py In Which Study? Human,mouse, gut, skin, oral microbiome Environmental, soil, water etc. hazy microbiome Any microbiome studies. QIIME developers suggests this method. compared. In the following table, advantages and disadvantages of OTU picking strategies are Table 2. Advantages and Disadvantages of OTU Picking Strategies OTU Picking Strgs. Advantages Disadvantages Closed reference De novo Open reference Fast and parallelizable. Suitable for big datasets. Since it uses reference databases, creates qualified taxonomies and trees. Clusters all sequences. Clusters all sequences. Some part of the work is being parallelized. Faster Not possible to find new species. Parallelizable is not enabled so slow for big datasets. Not parallelizable part of the work is slow. It might take very long

12 QIIME Analysis 12 OTU Picking Strgs. Advantages Disadvantages than De novo. time in the case of finding new species except in the reference databases. Open reference Otu picking strategy was used for our HMP data analysis and QIIME has pick_open_reference_otus.py script. This script walks through many substeps in a single step: it has (1) picked OTUs, (2) generated a representative sequence for each OTU, (3) assigned known taxonomy to those OTUs, (4) created a phylogenetic tree, and (5) created an OTU table. >pick_open_reference_otus.py -i <sequence_file.fna> r <97_otus.fasta > -o <outputpath > -s 0.1 -m <clustering algorithm> -p <parameter_file> 97_otus.fasta is the reference OTU file from Greengenes. Greengenes is the database of reference 16S sequences that is used to assign taxonomy. 97_otus.fasta file is created by clustering all the sequences in the Greengenes database into 97% identity clusters. A representative sequence is chosen from each of those clusters to be used to create the 97_tree and 97_taxonomy. Sequences in our data are compared by representative sequences in 97_otus.fasta and the most similar sequence s taxonomy is assigned to our sequence. Default clustering algorithm is UCLUST for pick_open_reference_otus.py script. But usearch is widely used for OTU picking, Usearch was used as clustering algorithm for our data. Parameter file was created by user with pick_otus:enable_rev_strand_match True line. This line is needed if most or all of the sequences are failing to hit the reference during the prefiltering or closed-reference OTU picking steps, sequences may be in the reverse orientation

13 QIIME Analysis 13 with respect to the reference database. This line addresses this problem, however it doubles the amount of memory used in the workflow. An index.html file was created and it is a navigation page and has an informative table about output files. The important outputs of the script are the following four files: rep_set.tre: The phylogenetic tree describing the relationship of all of our sequences rep_set.fna: The list of representative sequences for each Otu. otu_table_mc2_w_tax.biom: The final OTU results, including taxonomic assignments and per-sample abundances, stored in a biom file. Mc2 refers to minimum size 2 that means each OTU requires at least 2 sequences. This is the file mostly used for deeper analysis. final_otu_map_mc2.txt: the listing of which reads were clustered into which OTU. Basic Statistics on OTU Table biom summarize-table -i <biom_file> -o <outputpath> script is implemented in QIIME to create a summarization for otu table. Figure shows the summary file for biom file OUT was picked. If the representative sequence file rep_set.fna is counted, the same number of sequences should be displayed. assign_taxonomy.py -i <rep_set.fna> -o <taxonomyresults_outputpath> script is used to assign taxonomy for each OTU representative sequence. It creates rep_set_tax_assignments.txt file that contains an entry for each representative sequence, listing taxonomy to the greatest depth allowed by the confidence threshold (80% by default, can be

QIIME Analysis 14 changed with the -c option), and a column of confidence values for the deepest level of taxonomy shown. Figure 3. Summary for biom file Figure 4. rep_set_tax_assignments.

14 QIIME Analysis 14 changed with the -c option), and a column of confidence values for the deepest level of taxonomy shown. Figure 3. Summary for biom file Figure 4. rep_set_tax_assignments.txt OTU Heatmap make_otu_heatmap.py -i <biom file > -o <heatmap.pdf> script creates a pdf file with a visualization of OTU table. Each row corresponds to an OTU and each column corresponds to a sample. The higher the relative abundance of an OTU in a sample, the more intense the color at the corresponding position in the heatmap.

15 QIIME Analysis 15 Figure 5. Heatmap for HMP data Data Analysis Summarize Communities by Taxonomic Composition Looking at the relative abundances of taxa per sample in the OTU table, we could understand what microbes are found in each sample community. Question: Proportionally, what microbes are found in each sample community? Scripts: summarize_taxa.py and plot_taxa_summary.py Output: Visualized plots showing relative abundance data per samples summarize_taxa.py -i <biom file> -o <taxasummary_outputpath> script is used to generate text files with relative abundance data per samples to obtain a basic overview of the members of the community for all taxonomic ranks. The level specified at specific taxonomic ranks can be

txt> -l <taxonomic rank> -c pie,bar,area -o < taxscharts_outputpath> The following pie plot show the total relative abundance for all data. Figure 6.

16 QIIME Analysis 16 specified by -L parameters for the script (1 for kingdom, 2 for phylum, 3 for class, 4 for order, 5 for family, 6 for genus, 7 for species). Output text files can be passed to plot_taxa_summary.py script to create visualized plots a following command: plot_taxa_summary.py -i <taxasummary_outputpath/otu_table_w_tax.txt> -l <taxonomic rank> -c pie,bar,area -o < taxscharts_outputpath> The following pie plot show the total relative abundance for all data. Figure 6. Pie plot of the degree of sharing of microbial taxa in 14 collected samples from 7 different point with four months interval in a hospital room. Following area and bar plot shows the relative abundance of taxa for each sample.

QIIME Analysis 17 Figure 7. Area plot of the degree of sharing of microbial taxa in 14 collected samples from 7 different point with four months interval in a hospital room.

The following table shows the microbial composition for each sample at two different time point at phylum level.

17 QIIME Analysis 17 Figure 7. Area plot of the degree of sharing of microbial taxa in 14 collected samples from 7 different point with four months interval in a hospital room. Figure 8. Bar plot of the degree of sharing of microbial taxa in 14 collected samples from 7 different point with four months interval in a hospital room. The following table shows the microbial composition for each sample at two different time point at phylum level. From the plots, it looks like there is higher taxa change on computer mouse, counter top and tab faucet handles between two time points. On the other hand, those samples show similar taxa proportion in the same time point. This might be because the person who used

QIIME Analysis 18 those locations is the same person and in second time points, the person using those locations had been changed and it had modified the microbial abundance of taxa of samples in

Mouse April Countertop April Station Phone April Chair Armr. April Cold Tap W.F.H. April Hot Tap W.F.H. April Figure 9.

Question: How many species are in each sample? Script: alpha_rarefaction.py -i <biom file > -o < alphadiversity_outputpath> -p < parameters.txt > -m < mapping file > Output: Rarefaction plots.

18 QIIME Analysis 18 those locations is the same person and in second time points, the person using those locations had been changed and it had modified the microbial abundance of taxa of samples in second time point. Corr.Floor February Comp. Mouse February Countertop February Station Phone February Chair Armr. February Cold Tap W.F.H. February Hot Tap W.F.H. February Corr. Floor April Comp. Mouse April Countertop April Station Phone April Chair Armr. April Cold Tap W.F.H. April Hot Tap W.F.H. April Figure 9. Microbial composition of the microbial taxa in 14 collected samples Investigating Alpha Diversity Diversity of species in a single sample or environment is described by alpha diversity. Question: How many species are in each sample? Script: alpha_rarefaction.py -i <biom file > -o < alphadiversity_outputpath> -p < parameters.txt > -m < mapping file > Output: Rarefaction plots. This script is performed several steps: (1) generate rarefied OTU tables; (2) compute alpha diversity metrics for each rarefied OTU table; (3) collate alpha diversity results; and (4) generate alpha rarefaction plots. Alpha diversity increases with sequencing depth and rarefaction plots are useful to compare alpha diversity between two or more samples which may have unequal sequence depth. This plot uses alpha diversity value versus number of included

19 QIIME Analysis 19 sequences. To build rarefaction curves, each community is randomly subsampled without replacement at different intervals, and the average number of OTUs at each interval is plotted against the size of the subsample. As parameter file, alpha diversity metric is listed in a text file. Observed_species, shannon, chao1 metrics are commonly used alpha diversity metrics. Observed_species is the number of OTUs identifier per sample. Shannon diversity is a measure of entropy and chao1 is a measure which predicts OUT richness at high depth of sequencing. echo 'alpha_diversity:metrics observed_species,shannon,chao1' > parameters.txt command creates a parameter.txt file. After running the script on our data, a html page with rarefraction plots were created. Figure 10. Rarefraction Plot for date_s

20 QIIME Analysis 20 Figure 11. Rarefraction plot for sample_type_s Identifying Differentially Abundant OTUs Question: Are there species significantly more abundant in one set of samples than in another? Which microbes are significantly different between two sample groupings? Do specific groups of samples differ in their microbial composition? Script: differential_abundance.py -i < biom file > -o <output.txt> -m <mapping file> -a DESeq2_nbinom c <mapping category> -x < subcategory 1> -y <subcategory 2> -d plot. Output: text file with a list of differentially observed OTUs and their statistics and a MA OTU differential abundance testing is used to identify OTUs that differ between two mapping file sample categories denoted by x and y in the script. Differentially abundant OTUs identification method is denoted by a. DESeq2_nbinom and metagenomeseq_fitzig are differential abundance algorithm can be used in QIIME (Paulson, Stine, Bravo, & Pop, 2013). -d option creates a MA plot. The MA plot allows to look at the relationship between intensity and difference between two data stores. The x-axis represents the average quantitated

QIIME Analysis 21 value across the data stores, and the y axis shows the difference between them. It also creates a Dispersion Estimate plot that visualize the fitted dispersion vs. mean relationship.

21 QIIME Analysis 21 value across the data stores, and the y axis shows the difference between them. It also creates a Dispersion Estimate plot that visualize the fitted dispersion vs. mean relationship. In order to see if there are any OTUs which are significantly more abundant in the countertop environment samples than in the computer mouse environment samples, countertop was passed as y option and computer mouse was passed as x option. Checking the output text file, the members of Actinobacteria are significantly more abundant in the countertop samples. Figure 12. Diff_otus.txt for Computer Mouse and Countertop

22 QIIME Analysis 22 Figure 13. MA plot for differential abundance of Computer Mouse and Countertop Figure 14. Dispersion Estimate Plot for differential abundance of Computer Mouse and Countertop Checking the microbial abundance of taxa of computer mouse samples taken in february and april, it was seen visually different taxonomy fromthe pie charts. To do an experiment, differential abundance script was run on those samples and Figure 15 and 16 shows the MA plot and dispersion estimate plots.

23 QIIME Analysis 23 Figure 15. MA plot for Computer Mouse Samples. Figure 16. Dispersion Estimate Plot for Computer Mouse Samples Normalizing OTU Table When analyzing microbial data, uneven sequencing depth could lead biased results. Having different number of sequences for each sample will cause inaccurate results in beta diversity analyses. Question: How to prevent bias as result of uneven sequencing depth? Script: normalize_table.py -i <biom file> -a CSS -o <normalized biom file> Output: Biom table with normalized counts. This table is used as input biom file for beta diversity script. -a option determines the normalization algorithm to apply to input bio table. Default algorithm is CSS. CSS is stand for cumulative sum scaling normalization which is an adaptive extension of the quantile normalization approach that is better suited for marker gene survey data whereby

24 QIIME Analysis 24 raw counts are divided by the cumulative sum of counts up to a percentile determined using a data-driven approach (Paulson, J.N., Stine, O.C., Corrada Bravo, H., Pop, 2013). DESeq2 is another normalization algorithm option. DESeq2 outputs negative values for lower abundant OTUs as a result of its log transformation and throws away low depth samples (e.g. less that 1000 sequences/sample). This presents a problem when using Bray Curtis and Unifrac metrics which are common metrics to calculate ecological distance. There is not a good solution yet, but CSS is currently recommanded normalization algorithm. Beta-diversity and PCoA It is important to analyze how different every sample is from all of the rest in microbiome research. On the other hand, another important information is whether any grouping of samples are more similar in composition than the average. Beta diversity is a metric of diversity that describes how different the species composition of different sample is. Question: How much does diversity change between samples? Script: beta_diversity.py, principal_coordinates.py, make_2d_plots.py Output: Distance matrix and visualized Principle Coordinate plots In order to measure the difference between two samples mathematical and phylogenetic metrics can be used. Two commonly used metrics in microbiome studies are Bray_Curtis and unweighted_unifrac. >beta_diversity.py -i <normalized biom file> -m <distance metric> -o <beta_div_output_path> -t <rep_set.tre>

25 QIIME Analysis 25 The output of the command is a distance matrix defines distance between every pair of samples. I used Bray-Curtis metric to calculate distance. This matrix can be visualized in a Principle Coordinate plot (PCoA). principal_coordinates.py -i <beta_div_output_path>/<metric_normalized_otu_table.txt > -o <beta_div_coords.txt> make_2d_plots.py -i <beta_div_coords.txt> -m <mapping file> The resulting PCoA plot is shown in the following charts. Figure 15 shows microbial community similarity change between two sample collection dates and it looks like overall community mostly changed in two timepoint. Figure 16 shows the microbial community similarity among sample types. It looks like computer mouse, countertop, stationary phone, armchair rest visualized together meaning that they have similar microbial community. Computer mouse - countertop samples collected in february but stationary phone - armchair rest samples collected in april. It can also be visually displayed in the pie charts that these samples have very similar charts. Pie charts shows very different composition for computer mouse and countertop samples in two different time point. It can also be viewed from the PcoA plots. For example, two purple circle stay far away between each other on the PC1-PC2 and PC1-PC3 plots in Figure 18. April February Figure 17. PCoA plot for the bacterial community collected in the Hospital Room. Community were characterized by samples collected in February and April. Bray-Curtis is used as distance metric.

26 QIIME Analysis 26 Cold T.W.F.H Hot T.W.F.H Comp. Mouse Countertop Station Phone Armchair Rest Corridor Floor Figure 18.PCoA plot for the bacterial community collected in the Hospital Room. Community were characterized by type of samples collected. Bray-Curtis is used as distance metric. Jackknifed Beta Diversity Analysis Question: How to compare samples to each other? Script: jackknifed_beta_diversity.py -i < biom file > -t <rep_set.tre> -m <mapping file > -o <Jackknife_Output folder> -e <rarefaction_depth>; Output: 3D PcoA plots with Emperor This script does the following steps: i. Compute a beta diversity distance matrix from the full data set ii. Perform multiple rarefactions at a single depth (-e option is to change the rarefaction depth) iii. Compute distance matrices for all the rarefied OTU tables iv. Build UPGMA trees for all the rarefactions v. Compare all the trees to get consensus and support values for branching vi. Perform principal coordinates analysis on all the rarefied distance matrices vii. Generate plots of the principal coordinates

QIIME Analysis 27 Emperor is an interactive next generation tool for analysis, visualization and interpretation of high throughput microbial ecology datasets (Vázquez-Baeza, Pirrung, Gonzalez, &

27 QIIME Analysis 27 Emperor is an interactive next generation tool for analysis, visualization and interpretation of high throughput microbial ecology datasets (Vázquez-Baeza, Pirrung, Gonzalez, & Knight, 2013). After running script, three sub-folder for each distance metric and 3D PCoA plots are created. Unweighted_uniFrac /emperor_pcoa_plot folder has a html file has visualized 3D PCoA Plots as in Figure 12. Each point represents one of the samples and distances between samples were calculated using unweighted UniFrac. Samples stay close to each other means that those samples have communities with very similar overall phylogenetic trees. Figure 19. 3D PCoA Plots for HMP samples Jackknife analysis created a large collection of distance matrices to do statistics on. Question: How to analyze distance matrices? Script: dissimilarity_mtx_stats.py i < Jackknife_Output folder/unweighted_unifrac/rare_dm> - o <stat_output_folder> Output: Three files; means.txt, medians.txt, and stdevs.txt files for the mean, standard deviation and means of the distance between two samples are created.

28 QIIME Analysis 28 Question: Are the samples in an individual category closer to each other than they are to samples outside the category? Script: make_distance_boxplots.py m <mapping file> -o <BoxPlot_Outout_Folder> -d stat_output_folder/means.txt f <category> --save_raw_data Output: Boxplot Plot as a pdf file The first and second boxplots represent all within distances and all between distances, respectively in Figure 14. Figure 20. Distance Boxplot for Surface type Question: How to compare between samples grouped at different field states of a mapping file field? Script: make_distance_comparison_plots.py -m <mapping file> -d <unweighted_unifrac_otu_table.txt> -f <category from mapping file> -c <comparison_groups> -o <output_folder> -a <label_type> -t <plot_type>

29 QIIME Analysis 29 Output: Distance Comparison Plot Figure 14 shows the boxplots that allow for the comparison among surface types. Countertop, Corridor Floor and Station Phone were taken as comparison groups and those were compared with other surface types. Make Bootstrapped Tree Figure 21. Distance Comparison among surface types Question: How to make a bootstrapped tree? Script: make_bootstrapped_tree.py -m <Jackknife_Output folder/unweighted_unifrac/upgma_cmp/master_tree.tre> -s <Jackknife_Output folder /unweighted_unifrac/upgma_cmp/jackknife_support.txt> -o <Jackknife_Output folder /unweighted_unifrac/upgma_cmp/tree.pdf>

30 QIIME Analysis 30 St. Phone February Cold T. W. F. H. February Countertop April Cold. T. W. F. H. April Comp. Mouse April Corr. Floor April Corr. Floor February St. Phone April Ch. Armrest April Hot. T. W. F. H. April Countertop February Comp. Mouse February Ch. Armrest February Hot T.W. H. February Figure 22. Jackknifed UPGMA clustering (using the weighted UniFrac metric) showing the similarity of bacterial communities based on 16S rrna genes. Comparing Categories In HMP data, seven different points in a room were sampled: countertop, computer mouse, station phone, chair armrest, corridor floor, hot tap water faucet and cold tap water faucet. Visual graphs reveal how different a microbial composition of sample from other samples, but a statistical support is needed. To generate statistical support for hypotheses, adonis and anosim (analysis of similarity) statistical tests can be used. Adonis is a nonparametric statistical method that takes beta diversity distance matrices, a mapping file and a category in the mapping file to determine sample grouping from. It computes an R2 value (effect size) which shows the percentage of variation explained by the supplied mapping file category, as well as a p-value to determine the statistical significance. Anosim (Permanova) is a method that tests whether two or more groups of samples

31 QIIME Analysis 31 are significantly different. Anosim only work with categorical variable that is used to do the grouping. Question: Do the samples grouped by a parameter in the mapping file (i.e. sample type) are statistically significant? Script 1: compare_categories.py --method adonis -i <metric_normalized_otu_table.txt > -m <mapping file> -c <comparingcategory> <adonis_out_folder> Script 2: compare_categories.py --method anosim -i <metric_normalized_otu_table.txt > -m <mapping file> -c <comparingcategory> -o <anosim_out_folder> Output: p-value and R 2 value. p-value indicates the statistically significance of grouping of samples by the parameter. R 2 value indicates the percentage of variation in distances is explained by the grouping. Adonis and anosim statistical tests were applied for sample_type_s and date_s categories in HMP data. date_s and sample_type_s do not differ significantly from each other in terms of microbial composition (p = 0.2, p = 0.58). Conclusion As a preliminary exploration, a small data set from HMP was analyzed. Data collected from seven different point (countertop, computer mouse, station phone, chair armrest, corridor floor, hot tap water faucet and cold tap water faucet) in the same room (S10) at two different time point (27/02/2013 and 17/04/2014) was used. For each sample, how many and what kind of microbes are found, diversity change between samples and microbial composition comparison among sample groupings were investigated using QIIME pipeline. Moreover, significant

32 QIIME Analysis 32 abundance change among samples was investigated. Visualization and statistical tools were used to draw conclusions. REFERENCES Caporaso, J. G., Kuczynski, J., Stombaugh, J., Bittinger, K., Bushman, F. D., Costello, E. K., Knight, R. (2010). QIIME allows analysis of high-throughput community sequencing data. Nature Methods, 7(5), Fierer, N., Breitbart, M., Nulton, J., Salamon, P., Lozupone, C., Jones, R., Jackson, R. B. (2007). Metagenomic and small-subunit rrna analyses reveal the genetic diversity of bacteria, archaea, fungi, and viruses in soil. Applied and Environmental Microbiology, 73(21), Paulson, J.N., Stine, O.C., Corrada Bravo, H., Pop, M. (2013). Robust methods for differential abundance analysis in marker gene surveys. Nature Methods, 10(12), Paulson, J. N., Stine, O. C., Bravo, H. C., & Pop, M. (2013). Differential abundance analysis for microbial marker-gene surveys. Nature Methods, 10(12), Shogan, B. D., Smith, D. P., Packman, A. I., Kelley, S. T., Landon, E. M., Bhangar, S., Gilbert, J. (2013). The Hospital Microbiome Project: Meeting report for the 2nd Hospital Microbiome Project, Chicago, USA, January 15(th), Standards in Genomic Sciences, 8(3),

33 QIIME Analysis 33 Vázquez-Baeza, Y., Pirrung, M., Gonzalez, A., & Knight, R. (2013). EMPeror: a tool for visualizing high-throughput microbial community data. GigaScience, 2(1), 16.

Introduction to taxonomic analysis of metagenomic amplicon and shotgun data with QIIME. Peter Sterk EBI Metagenomics Course 2014

Introduction to taxonomic analysis of metagenomic amplicon and shotgun data with QIIME Peter Sterk EBI Metagenomics Course 2014 1 Taxonomic analysis using next-generation sequencing Objective we want to