Introduction to taxonomic analysis of metagenomic amplicon and shotgun data with QIIME. Peter Sterk EBI Metagenomics Course PDF Free Download

Introduction to taxonomic analysis of metagenomic amplicon and shotgun data with QIIME Peter Sterk EBI Metagenomics Course 2014 1

Taxonomic analysis using next-generation sequencing Objective we want to obtain samples from a particular environment to find out what lives in it. Know your sample What kind of samples do we have? Soil, water, host-associated (e.g. gut), etc. What do we expect to find in those samples? Prokaryotes, eukaryotic microorganisms (e.g. protists, fungi), viruses? Decide what you want to find out, e.g. bacteria/archaea populations all microbes (including eukaryotic ones) Design your experiment around that 2

Some terminology Amplicon: a DNA fragment that is amplified with PCR, e.g. one or more 16S rrna variable regions, or other marker genes. Most researchers will make use of standard PCR primers Clustering: grouping sequences in bins (or clusters) based on a percent similarity threshold. Operational Taxonomic Unit (OTU): species distinction in microbiology. Typically using rdna and a percent similarity threshold for classifying microbes within the same, or different, OTUs. Note that an OTU is distinct from a species. For bacteria/ archaea, OTUs are clusters of reads that are >97% identical. Barcode: a short DNA sequence that is added to each read during amplification and that is specific for a given sample. This allows samples to be mixed (multiplexed) to reduce sequencing cost. During analysis sequences need to be demultiplexed, i.e. separated by sample. 3

Common approaches: amplicon-based Sequencing of (regions of) target genes (amplicons) obtained by PCR using gene specific primers. For bacteria/archaea, the target is usually a 16S rrna gene fragment containing one of more variable regions, internal transcribed spacer (ITS) for fungi, 18S rrna gene fragments for eukaryotes Analysis usually requires a reference database that is searched to find the closest match to an OTU from which a taxonomic lineage is inferred. Some examples: Greengenes (http://greengenes.lbl.gov) (16S) Ribosomal Database Project (http://rdp.cme.msu.edu) (16S) Silva (http://www.arb-silva.de) 16S + 18S Unite (http://unite.ut.ee) ITS Less suitable for certain groups of organisms such as protists these are extremely diverse and only few have sequence information. The same goes for viruses. We will mainly focus on 16S analysis during the hands-on as this is most common, but you must decide whether this is suitable for your work. We will also spend a little time on taxonomic analysis of Illumina shotgun data 4

Hands-on QIIME tutorial QIIME is an open source software package for comparison and analysis of microbial communities, primarily based on high-throughput amplicon sequencing data (such as SSU rrna) generated on a variety of platforms. It is widely used and supported. We will use the latest version of QIIME (Quantitative Insights Into Microbial Ecology; qiime.org; version 1.8), pronounced chime to analyze 26 soil samples from a diesel-contaminated railway site (Sutton et al. 2013). You will have an electronic copy of the paper with your training materials. We have randomly picked 5000 reads from the original Roche 454 dataset to speed up the analysis. We also provide a pre-computed analysis of the full dataset. QIIME is used in the EBI metagenomics pipeline with whole genome shotgun data. EBI metagenomics currently does not analyze amplicon data as standard. However, with the help of this tutorial you could soon be analyzing your own amplicon data sets. We will spend some time on the analysis of an Illumina shotgun dataset, a metagenome of a microbial consortium obtained from the Tuna oil field in the Gippsland Basin, Australia (Dongmei et al. 2013 and Sutcliffe et al. 2013). 5

OTU picking strategies in QIIME De novo Use for amplicons that overlap Use if you do not have a reference sequence collection Clusters all reads without using a reference Not very suitable for very large data sets (cannot be run in parallel) (I will explain this strategy in more detail) Closed-reference Use if amplicons (or shotgun reads) do not overlap And you have a reference sequence collection Note: reads that do not hit a reference sequence are discarded Open-reference Use for amplicons that overlap Reads are clustered against a reference sequence Reads that do not match are clustered de novo 6

Common approaches: metagenomic analysis Identification of reads with 16S sequence (e.g. using rrnaselector) and closedreference OTU picking in QIIME. We will analyze an artificially small Illumina dataset during the hands-on. Blast-based analysis. E.g. blasting reads against the NCBI non-redundant nucleotide or protein data databases and inferring taxonomic lineage from the best hit The tool MEGAN requires Blast output. A major drawback is that without preprocessing of NGS datasets and access to a major computational resource, this is not an option for most. MetaPhlAn approach (http://huttenhower.sph.harvard.edu/metaphlan) relies on unique clade-specific marker genes identified from 3,000 reference genomes fast, but limited to certain types of study (mainly human microbiome) 7

De novo OTU picking in detail We will now go through the de novo OTU picking steps in more detail and focus on the diesel-contaminated railway line study. We will perform the actual analysis during the hands-on session today. We will largely follow the QIIME 454 overview tutorial at http://qiime.org/tutorials/tutorial.html Aim of our study: Understand interrelationship among microbial community composition, pollution level, and soil geochemical and physical properties. Sequencing technology/chemistry: Roche 454 FLX Titanium Amplicon: V3 + V4 region of the 16S rrna gene 8

Overview of the diesel-contaminated railway site In 2010 26 samples were taken from 9 locations at different depths: A1: Fill; Polluted A2: Fill_Polluted B1: Fill; Clean B2: Clay; Polluted B3: Peat; Polluted B4: Peat; Polluted C1: Fill; Clean C2: Peat; Clean C3: Peat; Polluted D1: Fill; Clean D2: Clay; Clean D3: Clay; Polluted D4: Peat; Polluted D5: Sand; Polluted E1: Fill; Clean E2: Fill; Polluted F1: Sand; Clean F2: Sand; Polluted G1: Fill; Clean G2: Fill; Clean G3: Fill; Clean H1: Peat; Clean H2: Peat; Clean H3: Sand; Clean I1: Sand; Clean I2: Sand; Clean 9

The targeted 16S rrna gene region The targeted region is a 466 bp fragment containing the 16S rrna V3 and V4 hypervariable region Each sample has a sequence primer adapter and 10 nucleotide barcode to allow multiplexing (sequencing all samples on the same plate mainly to reduce sequencing cost) The sequence file is in Roche 454 SFF format 10

The analysis in detail (1) File preparation The standard 454 data format is SFF. We need to extract the fasta sequences and quality scores in two separate files. We will use the tool sffinfo from Roche. >GW6RNWL02GKV5K length=463 xy=2581_0822 region=2 run=r_2011_02_04_06_15_22_ ACATACGCGTCCTATGGGATGCAGCAGGCGCGAAAACTTTACAATGCCGGCAACGGCGAT >GW6RNWL02HFI7P length=418 xy=2930_0883 region=2 run=r_2011_02_04_06_15_22_ ACATACGCGTCCTATGGGATGCAGCAGGCGCGAAAACTTTACAATGCTGGCAACAGCGAT... AAGGGAACCTCGAGTGCCAGGTTACAAATCTGGCTGTCGAGATGCCTAAAAAGCATTTCA... >GW6RNWL02GKV5K length=463 xy=2581_0822 region=2 run=r_2011_02_04_06_15_22_ 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 39 39 39 40 40 40 40 40 40 40 40 40 40 40 40 40 40 29 29 29 29 40 39 39 39 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40... >GW6RNWL02HFI7P length=418 xy=2930_0883 region=2 run=r_2011_02_04_06_15_22_ 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 39 39 39 40 40 40 40 40 40 40 40 40 40 40 40 40 38 21 21 21 21 38 39 39 39 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40... 11

The analysis in detail (2) Assign reads to samples using barcode information and perform some quality control We need to provide a tab-delimited mapping file that provides at a minimum the name of each sample, the barcode to identify the different samples, the linker/primer sequence used to amplify the DNA, and a description of the sample #SampleID BarcodeSequence LinkerPrimerSequence Description A1 ACATACGCGT CCTAYGGGRBGCASCAG A1_Fill_Polluted A2 ACGCGAGTAT CCTAYGGGRBGCASCAG A2_Fill_Polluted B1 ACTACTATGT CCTAYGGGRBGCASCAG B1_Fill_Clean etc. For example, sequence reads that have the sequence ACATACGCGT near the start will be assigned to sample A1. The procedure we use will rename headers in the fasta and quality files accordingly. It also removes the barcode and primer sequences from the reads as these interfere with the OTU picking. 12

Optional: Denoising 454 data (flowgram clustering) A small number of reads from Roche 454 pyrosequencing runs have characteristic errors when longer homopolymer runs are present. These reads give rise to erroneous OTUs. A procedure called denoising or flowgram clustering removes problematic reads and increases the accuracy of the taxonomic analysis Denoising is computationally expensive and we will therefore skip this procedure in the hands-on. If you work with 454 amplicon data and your file uses the older regular flow pattern, consider denoising. See http://qiime.org/tutorials/denoising_454_data.html. Read the warning about the new random flow patterns. Remember that denoising does not make sense with shotgun data. 13

The analysis in detail (3) Pick Operational Taxonomic Units. These are collections of sequences that are highly similar (here 97% or more). Taxonomic assignments are done on these OTUs. We will perform de novo OTU picking. The QIIME workflow will produce a number of output files. A list of OTUs with taxonomic assignments with the hierarchy: kingdom, phylum, class, order, family, genus, species. Most OTUs cannot be classified up to species level. E.g: denovo745 k Bacteria; p Proteobacteria; c Alphaproteobacteria; o Rhizobiales; f Rhizobiaceae; g Agrobacterium; s 1.00 3 A representation of a taxonomic tree in Newick format. The tree can be visualized in applications like FigTree. A file in biom (Biological Observation Matrix) format representing OTU tables. We will import this file into Megan 5 to visualize our results 14

De novo OTU picking in detail (1) Generate OTUs by clustering reads based on similarity (default is 97%) Sort reads according to size (long -> short) Cluster OTU1 OTU2 OTU3 OTU4 OTU5 15

De novo OTU picking in detail (2) Pick representative sequence for each OTU Assign taxonomy to each OTU OTU1 lineage 1 OTU2 lineage 2 OTU3 OTU4 OTU5 lineage 3 lineage 4 lineage 5 Reference database 16

De novo OTU picking in detail (3) Align OTU sequences (if you want to do further phylogenetic analysis) Optional: remove chimaeras from your alignment Filter alignment Create tree file in Newick format Create OTU table in biom format We can now visualize the results and do further analysis, such as alpha-diversity analysis (diversity within a sample) and beta-diversity analysis (diversity across samples) We will first have a quick look at Megan 5, a tool we will use to visualize our results. 17

A quick look at MEGAN 5 MEGAN stands for MEtaGenome ANalyzer and was written to help understand the composition and operation of complex microbial consortia. It is free for academic users and can be downloaded from http://ab.inf.uni-tuebingen.de/software/megan5/. In order to use MEGAN for both functional analysis and taxonomic analysis, a Blast step needs to be performed whereby a metagenomic dataset is Blast-ed against e.g. one of NCBI s non-redundant nucleotide or protein databases. This steps is extremely computationally expensive and not an option for many users. Recently support for the BIOM format was added, which allows us to visualize and analyze taxonomic analysis results from QIIME. Select import BIOM from the File menu. 18

Taxonomic tree display in MEGAN5 19

Rarefaction curves in MEGAN 5 20

Taxonomic composition of samples in MEGAN5 21

Selecting 16S rdna sequence with rrnaselector from shotgun data and closed-reference OTU picking with QIIME Amplicon studies offer insight into taxonomic diversity of samples, but they cannot be used to study function (or coding potential). Instead we need shotgun data. In an ideal world, to get the most out of your physical samples you d prepare multiple libraries (amplicon, metagenomic, transcriptomic). In practice most people don t. It is possible to get taxonomic information out of shotgun data. We ll discuss how we have approached this at the EBI. rrnaselector (1): select reads with rdna rrnaselector (2): remove non-rdna rdna sequence 22

Closed-reference OTU picking The set of clipped rdna reads obtained with rrnaselector is clustered against a reference database. 16S rdna reference set uclust X 23

Further phylogenetic analyses: taxa summary We can visualize the taxonomic composition of our samples. We will reproduce this figure during the hands-on session. We are looking at the composition at phylum level. A legend is also produced (not shown) 24

Further phylogenetic analyses: alpha diversity and rarefaction curves Alpha diversity looks at the species diversity within samples If you produced more sequence from your sample, you would expect the number of species to increase until a point where producing more sequence does not significantly increase the number of observed species. You can perform rarefaction analysis on your sample to find out whether you have sequenced at sufficient depth. Rarefaction analysis involves in silico repeated subsampling of your data at different intervals. For example, if your sample consists of 1000 sequences, you could randomly sample 100 reads (with e.g. 10 repetitions), then 200, 300 etc. You can then plot these subsamples against the number of observed species. If curves flatten, then you have sequenced at sufficient depth. 25

Divergence measurements between organisms Divergence-based diversity measures estimate the degree to which pairs of organisms differ Sequence distance: measure of sequence identity Phylogenetic distance: sum of branch lengths that separate two organisms in a phylogenetic tree (see fig A) Topological distance: as phylogenetic distance, but all branch lengths set the same (usually 1) Taxonomic distance. Taxonomic level separating two organisms (e.g. same species = 1, same genus = 2, same family = 3, etc) Usually, where sequence data is available (e.g. 16S rrna), sequence or phylogenetic distance measurements are most powerful If phylogenetic trees with meaningful branch lengths are not available, but taxonomic relationships are well defined, topological or taxonomic distance measures can be used (most commonly used for macroorganisms) PD for grey is sum of grey brachnes 26

Measures of alpha diversity A community that contain taxa that are more divergent from each other is more diverse There are many ways to measure alpha diversity, below a few examples: Phylogenetic Diversity (PD): measures the total sum of branch lengths in a phylogenetic tree that leads to each community member. Qualitative measure of divergence Theta: measures the average divergence between two randomly chosen sequences (individuals). Quantitative as it accounts for both evenness and divergence between taxa (Low evenness: numerically dominance of a few species) Chao 1: species-based qualitative measure Shannon: species-based quantitative measure 27

Further phylogenetic analyses: beta diversity Beta diversity analysis compares diversity between each sample in your study. We calculate the distance between a pair of samples and we do this for all samples. We obtain a distance matrix that we can visualize in a number of ways, e.g. as a tree, a network or a principal coordinates (PCoA) plot. During the hands-on we will generate PCoA plots to visualize the distances between our samples in 3-dimensional space. We ll have a separate tutorial on visualization with Emperor. As our samples show variation in sequencing depth, we will use the number of reads from the smallest sample as our sequencing depth and rarify all other samples at this depth. 28

Measures of community distance: UniFrac There are many ways to measure beta diversity (see e.g. Lozupone and Knight, 2009) for summary Divergence-based measures: communities are considered more related if the taxa they contain are more closely related. UniFrac (qualitative): Measures phylogenetic distance between sets of taxa in a tree. Weighted UniFrac (quantitative): Variation of UniFrac that accounts for changes in relative abundance of lineages between communities. Quantitative measures depends on accurate information of relative abundance of sequences (could be biased by lab procedures) UniFrac allows you to: Determine if the environments in the input phylogenetic tree have significantly different microbial communities. Determine if community differences are concentrated within particular lineages of the phylogenetic tree. Cluster environments to determine whether there are environmental factors (such as temperature or salinity) that group communities together. Determine whether the environments were sampled sufficiently to support cluster nodes. 29

QIIME analysis of Illumina amplicon data Data preparation differs from 454 analysis Closed-reference OTU picking can be parallelized and is therefore preferred For demultiplexing you need a mapping file (as discussed for 454), the fastq file containing the barcode sequence and the fastq file containing the reads. It is also possible to demultiplex samples if your data is from multiple lanes. For details see the following QIIME tutorial: http://qiime.org/tutorials/processing_illumina_data.html Note: for a full HiSeq2000 run, this process can take up to 500 CPU hours! 30

Finally This concludes the introduction to taxonomic analysis with QIIME. If taxonomic analysis is important to your work, then do spend time going through the different QIIME tutorials at http://qiime.org/tutorials/. Thank you 31

Introduction to taxonomic analysis of metagenomic amplicon and shotgun data with QIIME. Peter Sterk EBI Metagenomics Course 2014