Introduction to taxonomic analysis of metagenomic amplicon and shotgun data with QIIME. Peter Sterk EBI Metagenomics Course 2014

Similar documents
CBC Data Therapy. Metagenomics Discussion

Introduction to Bioinformatics analysis of Metabarcoding data

Microbiomes and metabolomes

Metagenomics Computational Genomics

I AM NOT A METAGENOMIC EXPERT. I am merely the MESSENGER. Blaise T.F. Alako, PhD EBI Ambassador

COMPARING MICROBIAL COMMUNITY RESULTS FROM DIFFERENT SEQUENCING TECHNOLOGIES

Carl Woese. Used 16S rrna to develop a method to Identify any bacterium, and discovered a novel domain of life

What is metagenomics?

Microbiome: Metagenomics 4/4/2018

dbcamplicons pipeline Amplicons

Carl Woese. Used 16S rrna to developed a method to Identify any bacterium, and discovered a novel domain of life

Bioinformatics for Microbial Biology

dbcamplicons pipeline Amplicons

Microbiomics I August 24th, Introduction. Robert Kraaij, PhD Erasmus MC, Internal Medicine

Experimental Design Microbial Sequencing

Chapter 7. Motif finding (week 11) Chapter 8. Sequence binning (week 11)

Microbiome Analysis. Research Day 2012 Ranjit Kumar

Applications of Next Generation Sequencing in Metagenomics Studies

Development of NGS metabarcoding. characterization of aerobiological samples. Lucia Muggia

Diversity Profiling Service: Sample preparation guide

NGS part 2: applications. Tobias Österlund

Diversity Profiling Service: Sample preparation guide

Infectious Disease Omics

Contents 16S rrna SEQUENCING DATA ANALYSIS TUTORIAL WITH QIIME... 5

ngs metagenomics target variation amplicon bioinformatics diagnostics dna trio indel high-throughput gene structural variation ChIP-seq mendelian


Conducting Microbiome study, a How to guide

Introduction to Microbial Community Analysis. Tommi Vatanen CS-E Statistical Genetics and Personalised Medicine

Robert Edgar. Independent scientist

mothur Workshop for Amplicon Analysis Michigan State University, 2013

Joint RuminOmics/Rumen Microbial Genomics Network Workshop

Introduc)on to QIIME on the IPython Notebook

Nature Biotechnology: doi: /nbt Supplementary Figure 1. MBQC base beta diversity, major protocol variables, and taxonomic profiles.

TECHNIQUES FOR STUDYING METAGENOME DATASETS METAGENOMES TO SYSTEMS.

Parts of a standard FastQC report

Contact us for more information and a quotation

Introduction to OTU Clustering. Susan Huse August 4, 2016

HMP Data Set Documentation

RHIZOSPHERE METAGENOMICS OF THREE BIOFUEL CROPS. Jiarong Guo

Practical Bioinformatics for Life Scientists. Week 14, Lecture 27. István Albert Bioinformatics Consulting Center Penn State

SUPPLEMENTARY INFORMATION

Report on database pre-processing

HmmUFOtu: An HMM and phylogenetic placement based ultra-fast taxonomic assignment and OTU picking tool for microbiome amplicon sequencing studies

An introduction into 16S rrna gene sequencing analysis. Stefan Boers

A comparison of sequencing platforms and bioinformatics pipelines for compositional analysis of the gut microbiome

Assigning Sequences to Taxa CMSC828G

Sequencing Errors, Diversity Estimates, and the Rare Biosphere

mothur tutorial STAMPS, 2013 Kevin R. Theis Department of Zoology BEACON Center for the Study of Evolution in Action Michigan State University

SANBio BIOINFORMATICS TRAINING COURSE THE MICROBIOME: ANALYSIS OF NGS DATA CBIO-PIPELINE SAMSON, KM

Next Generation Sequencing. Tobias Österlund

Evaluation of a Short-Term Scientific Mission (STSM) Cost Action ES1406 KEYSOM soil biodiversity of European transect

Supplementary Figure 1 Schematic view of phasing approach. A sequence-based schematic view of the serial compartmentalization approach.

OMNIgene GUT stabilizes the microbiome profile at ambient temperature for 60 days and during transport

Name: Ally Bonney. Date: January 29, 2015 February 24, Purpose

Matthew Tinning Australian Genome Research Facility. July 2012

Supplementary Figures

Microbial community structure and a core microbiome in biological rapid sand filters at Danish waterworks

Evaluation of the liver abscess microbiome and liver abscess prevalence in cattle reared for production of natural branded beef

Optimizing taxonomic classification of marker gene amplicon sequences

Next-generation sequencing and quality control: An introduction 2016

Measuring the human gut microbiome: new tools and non alcoholic fatty liver disease

Introduction to Microbial Sequencing

Fungal ITS Bioinformatics Efforts in Alaska

Exercices: Metagenomics. Find Rapidly OTU with Galaxy Solution

Bioinformatic tools for metagenomic data analysis

Methods for the phylogenetic inference from whole genome sequences and their use in Prokaryote taxonomy. M. Göker, A.F. Auch, H. P.

Integrating Evolutionary, Ecological and Statistical Approaches to Metagenomics. A proposal to the Gordon and Betty Moore Foundation

Evaluation of the liver abscess microbiome and liver abscess prevalence in cattle reared for production of natural branded beef

A FRAMEWORK FOR ANALYSIS OF METAGENOMIC SEQUENCING DATA

Next Gen Sequencing. Expansion of sequencing technology. Contents

Strain/species identification in metagenomes using genome-specific markers. Tu, He and Zhou Nucleic Acids Research

CBC Data Therapy. Metatranscriptomics Discussion

Getting of the representative sequences from the clusters (consensus/most abundant) *(MAFFT) Identification of OTUs *(BLAST)

Advisors: Prof. Louis T. Oliphant Computer Science Department, Hiram College.

Distribution-Based Clustering: Using Ecology To Refine the Operational Taxonomic Unit

SUPPLEMENTARY INFORMATION

Jianguo (Jeff) Xia, Assistant Professor McGill University, Quebec Canada June 26, 2017

MicroSEQ TM ID Rapid Microbial Identification System:

Microbial sequencing solutions

Protist diversity along a salinity gradient in a coastal lagoon

16s Metagenomic Analysis Tutorial Max Planck Society

Analyzing the Leaf Microbiome. Jason Wallace Cornell University

Supplementary Information for

MicroSEQ Rapid Microbial Identifi cation System

Genome Sequence Assembly

Assessing barley malt associated microbial diversity using next generation sequencing

Introductie en Toepassingen van Next-Generation Sequencing in de Klinische Virologie. Sander van Boheemen Medical Microbiology

Lecture 7. Next-generation sequencing technologies

Introduction to Microbiome Omics Technologies

MICROBIOMICS Current and future tools of the trade

SO YOU WANT TO DO A: RNA-SEQ EXPERIMENT MATT SETTLES, PHD UNIVERSITY OF CALIFORNIA, DAVIS

DNA. bioinformatics. genomics. personalized. variation NGS. trio. custom. assembly gene. tumor-normal. de novo. structural variation indel.

Analysis of milk microbial profiles using 16s rrna gene sequencing in milk somatic cells and fat

USEARCH software and documentation Copyright Robert C. Edgar All rights reserved.

Supplementary Online Content

Genetic Sequencing Methodologies to Assess Human Contributions of Fecal Coliforms to a Freshwater Receiving Stream Introduction Sample Collection

Introduction to metagenome assembly. Bas E. Dutilh Metagenomic Methods for Microbial Ecologists, NIOO September 18 th 2014

Experimental Design. Dr. Matthew L. Settles. Genome Center University of California, Davis

Welcome to the NGS webinar series

Lecture 01: Overview of Metagenomics

Transcription:

Introduction to taxonomic analysis of metagenomic amplicon and shotgun data with QIIME Peter Sterk EBI Metagenomics Course 2014 1

Taxonomic analysis using next-generation sequencing Objective we want to obtain samples from a particular environment to find out what lives in it. Know your sample What kind of samples do we have? Soil, water, host-associated (e.g. gut), etc. What do we expect to find in those samples? Prokaryotes, eukaryotic microorganisms (e.g. protists, fungi), viruses? Decide what you want to find out, e.g. bacteria/archaea populations all microbes (including eukaryotic ones) Design your experiment around that 2

Some terminology Amplicon: a DNA fragment that is amplified with PCR, e.g. one or more 16S rrna variable regions, or other marker genes. Most researchers will make use of standard PCR primers Clustering: grouping sequences in bins (or clusters) based on a percent similarity threshold. Operational Taxonomic Unit (OTU): species distinction in microbiology. Typically using rdna and a percent similarity threshold for classifying microbes within the same, or different, OTUs. Note that an OTU is distinct from a species. For bacteria/ archaea, OTUs are clusters of reads that are >97% identical. Barcode: a short DNA sequence that is added to each read during amplification and that is specific for a given sample. This allows samples to be mixed (multiplexed) to reduce sequencing cost. During analysis sequences need to be demultiplexed, i.e. separated by sample. 3

Common approaches: amplicon-based Sequencing of (regions of) target genes (amplicons) obtained by PCR using gene specific primers. For bacteria/archaea, the target is usually a 16S rrna gene fragment containing one of more variable regions, internal transcribed spacer (ITS) for fungi, 18S rrna gene fragments for eukaryotes Analysis usually requires a reference database that is searched to find the closest match to an OTU from which a taxonomic lineage is inferred. Some examples: Greengenes (http://greengenes.lbl.gov) (16S) Ribosomal Database Project (http://rdp.cme.msu.edu) (16S) Silva (http://www.arb-silva.de) 16S + 18S Unite (http://unite.ut.ee) ITS Less suitable for certain groups of organisms such as protists these are extremely diverse and only few have sequence information. The same goes for viruses. We will mainly focus on 16S analysis during the hands-on as this is most common, but you must decide whether this is suitable for your work. We will also spend a little time on taxonomic analysis of Illumina shotgun data 4

Hands-on QIIME tutorial QIIME is an open source software package for comparison and analysis of microbial communities, primarily based on high-throughput amplicon sequencing data (such as SSU rrna) generated on a variety of platforms. It is widely used and supported. We will use the latest version of QIIME (Quantitative Insights Into Microbial Ecology; qiime.org; version 1.8), pronounced chime to analyze 26 soil samples from a diesel-contaminated railway site (Sutton et al. 2013). You will have an electronic copy of the paper with your training materials. We have randomly picked 5000 reads from the original Roche 454 dataset to speed up the analysis. We also provide a pre-computed analysis of the full dataset. QIIME is used in the EBI metagenomics pipeline with whole genome shotgun data. EBI metagenomics currently does not analyze amplicon data as standard. However, with the help of this tutorial you could soon be analyzing your own amplicon data sets. We will spend some time on the analysis of an Illumina shotgun dataset, a metagenome of a microbial consortium obtained from the Tuna oil field in the Gippsland Basin, Australia (Dongmei et al. 2013 and Sutcliffe et al. 2013). 5

OTU picking strategies in QIIME De novo Use for amplicons that overlap Use if you do not have a reference sequence collection Clusters all reads without using a reference Not very suitable for very large data sets (cannot be run in parallel) (I will explain this strategy in more detail) Closed-reference Use if amplicons (or shotgun reads) do not overlap And you have a reference sequence collection Note: reads that do not hit a reference sequence are discarded Open-reference Use for amplicons that overlap Reads are clustered against a reference sequence Reads that do not match are clustered de novo 6

Common approaches: metagenomic analysis Identification of reads with 16S sequence (e.g. using rrnaselector) and closedreference OTU picking in QIIME. We will analyze an artificially small Illumina dataset during the hands-on. Blast-based analysis. E.g. blasting reads against the NCBI non-redundant nucleotide or protein data databases and inferring taxonomic lineage from the best hit The tool MEGAN requires Blast output. A major drawback is that without preprocessing of NGS datasets and access to a major computational resource, this is not an option for most. MetaPhlAn approach (http://huttenhower.sph.harvard.edu/metaphlan) relies on unique clade-specific marker genes identified from 3,000 reference genomes fast, but limited to certain types of study (mainly human microbiome) 7

De novo OTU picking in detail We will now go through the de novo OTU picking steps in more detail and focus on the diesel-contaminated railway line study. We will perform the actual analysis during the hands-on session today. We will largely follow the QIIME 454 overview tutorial at http://qiime.org/tutorials/tutorial.html Aim of our study: Understand interrelationship among microbial community composition, pollution level, and soil geochemical and physical properties. Sequencing technology/chemistry: Roche 454 FLX Titanium Amplicon: V3 + V4 region of the 16S rrna gene 8

Overview of the diesel-contaminated railway site In 2010 26 samples were taken from 9 locations at different depths: A1: Fill; Polluted A2: Fill_Polluted B1: Fill; Clean B2: Clay; Polluted B3: Peat; Polluted B4: Peat; Polluted C1: Fill; Clean C2: Peat; Clean C3: Peat; Polluted D1: Fill; Clean D2: Clay; Clean D3: Clay; Polluted D4: Peat; Polluted D5: Sand; Polluted E1: Fill; Clean E2: Fill; Polluted F1: Sand; Clean F2: Sand; Polluted G1: Fill; Clean G2: Fill; Clean G3: Fill; Clean H1: Peat; Clean H2: Peat; Clean H3: Sand; Clean I1: Sand; Clean I2: Sand; Clean 9

The targeted 16S rrna gene region The targeted region is a 466 bp fragment containing the 16S rrna V3 and V4 hypervariable region Each sample has a sequence primer adapter and 10 nucleotide barcode to allow multiplexing (sequencing all samples on the same plate mainly to reduce sequencing cost) The sequence file is in Roche 454 SFF format 10

The analysis in detail (1) File preparation The standard 454 data format is SFF. We need to extract the fasta sequences and quality scores in two separate files. We will use the tool sffinfo from Roche. >GW6RNWL02GKV5K length=463 xy=2581_0822 region=2 run=r_2011_02_04_06_15_22_ ACATACGCGTCCTATGGGATGCAGCAGGCGCGAAAACTTTACAATGCCGGCAACGGCGAT >GW6RNWL02HFI7P length=418 xy=2930_0883 region=2 run=r_2011_02_04_06_15_22_ ACATACGCGTCCTATGGGATGCAGCAGGCGCGAAAACTTTACAATGCTGGCAACAGCGAT... AAGGGAACCTCGAGTGCCAGGTTACAAATCTGGCTGTCGAGATGCCTAAAAAGCATTTCA... >GW6RNWL02GKV5K length=463 xy=2581_0822 region=2 run=r_2011_02_04_06_15_22_ 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 39 39 39 40 40 40 40 40 40 40 40 40 40 40 40 40 40 29 29 29 29 40 39 39 39 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40... >GW6RNWL02HFI7P length=418 xy=2930_0883 region=2 run=r_2011_02_04_06_15_22_ 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 39 39 39 40 40 40 40 40 40 40 40 40 40 40 40 40 38 21 21 21 21 38 39 39 39 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40... 11

The analysis in detail (2) Assign reads to samples using barcode information and perform some quality control We need to provide a tab-delimited mapping file that provides at a minimum the name of each sample, the barcode to identify the different samples, the linker/primer sequence used to amplify the DNA, and a description of the sample #SampleID BarcodeSequence LinkerPrimerSequence Description A1 ACATACGCGT CCTAYGGGRBGCASCAG A1_Fill_Polluted A2 ACGCGAGTAT CCTAYGGGRBGCASCAG A2_Fill_Polluted B1 ACTACTATGT CCTAYGGGRBGCASCAG B1_Fill_Clean etc. For example, sequence reads that have the sequence ACATACGCGT near the start will be assigned to sample A1. The procedure we use will rename headers in the fasta and quality files accordingly. It also removes the barcode and primer sequences from the reads as these interfere with the OTU picking. 12

Optional: Denoising 454 data (flowgram clustering) A small number of reads from Roche 454 pyrosequencing runs have characteristic errors when longer homopolymer runs are present. These reads give rise to erroneous OTUs. A procedure called denoising or flowgram clustering removes problematic reads and increases the accuracy of the taxonomic analysis Denoising is computationally expensive and we will therefore skip this procedure in the hands-on. If you work with 454 amplicon data and your file uses the older regular flow pattern, consider denoising. See http://qiime.org/tutorials/denoising_454_data.html. Read the warning about the new random flow patterns. Remember that denoising does not make sense with shotgun data. 13

The analysis in detail (3) Pick Operational Taxonomic Units. These are collections of sequences that are highly similar (here 97% or more). Taxonomic assignments are done on these OTUs. We will perform de novo OTU picking. The QIIME workflow will produce a number of output files. A list of OTUs with taxonomic assignments with the hierarchy: kingdom, phylum, class, order, family, genus, species. Most OTUs cannot be classified up to species level. E.g: denovo745 k Bacteria; p Proteobacteria; c Alphaproteobacteria; o Rhizobiales; f Rhizobiaceae; g Agrobacterium; s 1.00 3 A representation of a taxonomic tree in Newick format. The tree can be visualized in applications like FigTree. A file in biom (Biological Observation Matrix) format representing OTU tables. We will import this file into Megan 5 to visualize our results 14

De novo OTU picking in detail (1) Generate OTUs by clustering reads based on similarity (default is 97%) Sort reads according to size (long -> short) Cluster OTU1 OTU2 OTU3 OTU4 OTU5 15

De novo OTU picking in detail (2) Pick representative sequence for each OTU Assign taxonomy to each OTU OTU1 lineage 1 OTU2 lineage 2 OTU3 OTU4 OTU5 lineage 3 lineage 4 lineage 5 Reference database 16

De novo OTU picking in detail (3) Align OTU sequences (if you want to do further phylogenetic analysis) Optional: remove chimaeras from your alignment Filter alignment Create tree file in Newick format Create OTU table in biom format We can now visualize the results and do further analysis, such as alpha-diversity analysis (diversity within a sample) and beta-diversity analysis (diversity across samples) We will first have a quick look at Megan 5, a tool we will use to visualize our results. 17

A quick look at MEGAN 5 MEGAN stands for MEtaGenome ANalyzer and was written to help understand the composition and operation of complex microbial consortia. It is free for academic users and can be downloaded from http://ab.inf.uni-tuebingen.de/software/megan5/. In order to use MEGAN for both functional analysis and taxonomic analysis, a Blast step needs to be performed whereby a metagenomic dataset is Blast-ed against e.g. one of NCBI s non-redundant nucleotide or protein databases. This steps is extremely computationally expensive and not an option for many users. Recently support for the BIOM format was added, which allows us to visualize and analyze taxonomic analysis results from QIIME. Select import BIOM from the File menu. 18

Taxonomic tree display in MEGAN5 19

Rarefaction curves in MEGAN 5 20

Taxonomic composition of samples in MEGAN5 21

Selecting 16S rdna sequence with rrnaselector from shotgun data and closed-reference OTU picking with QIIME Amplicon studies offer insight into taxonomic diversity of samples, but they cannot be used to study function (or coding potential). Instead we need shotgun data. In an ideal world, to get the most out of your physical samples you d prepare multiple libraries (amplicon, metagenomic, transcriptomic). In practice most people don t. It is possible to get taxonomic information out of shotgun data. We ll discuss how we have approached this at the EBI. rrnaselector (1): select reads with rdna rrnaselector (2): remove non-rdna rdna sequence 22

Closed-reference OTU picking The set of clipped rdna reads obtained with rrnaselector is clustered against a reference database. 16S rdna reference set uclust X 23

Further phylogenetic analyses: taxa summary We can visualize the taxonomic composition of our samples. We will reproduce this figure during the hands-on session. We are looking at the composition at phylum level. A legend is also produced (not shown) 24

Further phylogenetic analyses: alpha diversity and rarefaction curves Alpha diversity looks at the species diversity within samples If you produced more sequence from your sample, you would expect the number of species to increase until a point where producing more sequence does not significantly increase the number of observed species. You can perform rarefaction analysis on your sample to find out whether you have sequenced at sufficient depth. Rarefaction analysis involves in silico repeated subsampling of your data at different intervals. For example, if your sample consists of 1000 sequences, you could randomly sample 100 reads (with e.g. 10 repetitions), then 200, 300 etc. You can then plot these subsamples against the number of observed species. If curves flatten, then you have sequenced at sufficient depth. 25

Divergence measurements between organisms Divergence-based diversity measures estimate the degree to which pairs of organisms differ Sequence distance: measure of sequence identity Phylogenetic distance: sum of branch lengths that separate two organisms in a phylogenetic tree (see fig A) Topological distance: as phylogenetic distance, but all branch lengths set the same (usually 1) Taxonomic distance. Taxonomic level separating two organisms (e.g. same species = 1, same genus = 2, same family = 3, etc) Usually, where sequence data is available (e.g. 16S rrna), sequence or phylogenetic distance measurements are most powerful If phylogenetic trees with meaningful branch lengths are not available, but taxonomic relationships are well defined, topological or taxonomic distance measures can be used (most commonly used for macroorganisms) PD for grey is sum of grey brachnes 26

Measures of alpha diversity A community that contain taxa that are more divergent from each other is more diverse There are many ways to measure alpha diversity, below a few examples: Phylogenetic Diversity (PD): measures the total sum of branch lengths in a phylogenetic tree that leads to each community member. Qualitative measure of divergence Theta: measures the average divergence between two randomly chosen sequences (individuals). Quantitative as it accounts for both evenness and divergence between taxa (Low evenness: numerically dominance of a few species) Chao 1: species-based qualitative measure Shannon: species-based quantitative measure 27

Further phylogenetic analyses: beta diversity Beta diversity analysis compares diversity between each sample in your study. We calculate the distance between a pair of samples and we do this for all samples. We obtain a distance matrix that we can visualize in a number of ways, e.g. as a tree, a network or a principal coordinates (PCoA) plot. During the hands-on we will generate PCoA plots to visualize the distances between our samples in 3-dimensional space. We ll have a separate tutorial on visualization with Emperor. As our samples show variation in sequencing depth, we will use the number of reads from the smallest sample as our sequencing depth and rarify all other samples at this depth. 28

Measures of community distance: UniFrac There are many ways to measure beta diversity (see e.g. Lozupone and Knight, 2009) for summary Divergence-based measures: communities are considered more related if the taxa they contain are more closely related. UniFrac (qualitative): Measures phylogenetic distance between sets of taxa in a tree. Weighted UniFrac (quantitative): Variation of UniFrac that accounts for changes in relative abundance of lineages between communities. Quantitative measures depends on accurate information of relative abundance of sequences (could be biased by lab procedures) UniFrac allows you to: Determine if the environments in the input phylogenetic tree have significantly different microbial communities. Determine if community differences are concentrated within particular lineages of the phylogenetic tree. Cluster environments to determine whether there are environmental factors (such as temperature or salinity) that group communities together. Determine whether the environments were sampled sufficiently to support cluster nodes. 29

QIIME analysis of Illumina amplicon data Data preparation differs from 454 analysis Closed-reference OTU picking can be parallelized and is therefore preferred For demultiplexing you need a mapping file (as discussed for 454), the fastq file containing the barcode sequence and the fastq file containing the reads. It is also possible to demultiplex samples if your data is from multiple lanes. For details see the following QIIME tutorial: http://qiime.org/tutorials/processing_illumina_data.html Note: for a full HiSeq2000 run, this process can take up to 500 CPU hours! 30

Finally This concludes the introduction to taxonomic analysis with QIIME. If taxonomic analysis is important to your work, then do spend time going through the different QIIME tutorials at http://qiime.org/tutorials/. Thank you 31