Introduction to OTU Clustering. Susan Huse August 4, 2016

Similar documents
Sequencing Errors, Diversity Estimates, and the Rare Biosphere

Quality Filtering of Illumina Sequences. Susan Huse Brown University August 6, 2015

Robert Edgar. Independent scientist

mothur tutorial STAMPS, 2013 Kevin R. Theis Department of Zoology BEACON Center for the Study of Evolution in Action Michigan State University

mothur Workshop for Amplicon Analysis Michigan State University, 2013

Introduction to Bioinformatics analysis of Metabarcoding data

Introduction to taxonomic analysis of metagenomic amplicon and shotgun data with QIIME. Peter Sterk EBI Metagenomics Course 2014

Carl Woese. Used 16S rrna to develop a method to Identify any bacterium, and discovered a novel domain of life

Metagenomics Computational Genomics

Carl Woese. Used 16S rrna to developed a method to Identify any bacterium, and discovered a novel domain of life

Assessing and Improving Methods Used in Operational Taxonomic Unit-Based Approaches for 16S rrna Gene Sequence Analysis

Microbiomes and metabolomes

Lecture 01: Overview of Metagenomics


Stability of operational taxonomic units: an important but neglected property for analyzing microbial diversity

Methods for the phylogenetic inference from whole genome sequences and their use in Prokaryote taxonomy. M. Göker, A.F. Auch, H. P.

TECHNIQUES FOR STUDYING METAGENOME DATASETS METAGENOMES TO SYSTEMS.

Bellerophon; a program to detect chimeric sequences in multiple sequence

Bioinformatics for Microbial Biology

Nature Biotechnology: doi: /nbt Supplementary Figure 1. MBQC base beta diversity, major protocol variables, and taxonomic profiles.

dbcamplicons pipeline Amplicons

HmmUFOtu: An HMM and phylogenetic placement based ultra-fast taxonomic assignment and OTU picking tool for microbiome amplicon sequencing studies

dbcamplicons pipeline Amplicons

Fungal ITS Bioinformatics Efforts in Alaska

COMPARING MICROBIAL COMMUNITY RESULTS FROM DIFFERENT SEQUENCING TECHNOLOGIES

SUPPLEMENTARY INFORMATION

All subjects gave written informed consent before participating in this study, which

SANBio BIOINFORMATICS TRAINING COURSE THE MICROBIOME: ANALYSIS OF NGS DATA CBIO-PIPELINE SAMSON, KM

Report on database pre-processing

Genome Jigsaw: Implications of 16S Ribosomal RNA Gene Fragment Position for Bacterial Species Identification

Bioinformatic tools for metagenomic data analysis

Advisors: Prof. Louis T. Oliphant Computer Science Department, Hiram College.

1 Abstract. 2 Introduction. 3 Requirements. Most Wanted Taxa from the Human Microbiome The Broad Institute

Practical Bioinformatics for Life Scientists. Week 14, Lecture 27. István Albert Bioinformatics Consulting Center Penn State

What is metagenomics?

CBC Data Therapy. Metagenomics Discussion

Development of NGS metabarcoding. characterization of aerobiological samples. Lucia Muggia

Metagenomic 3C, full length 16S amplicon sequencing on Illumina, and the diabetic skin microbiome

BIOINFORMATICS TO ANALYZE AND COMPARE GENOMES

Evaluation of the liver abscess microbiome and liver abscess prevalence in cattle reared for production of natural branded beef

Phylogenetic methods for taxonomic profiling

Microbiome: Metagenomics 4/4/2018

HMP Data Set Documentation

Introduction to metagenome assembly. Bas E. Dutilh Metagenomic Methods for Microbial Ecologists, NIOO September 18 th 2014

BIOINFORMATICS ORIGINAL PAPER

Parts of a standard FastQC report

Oligotyping analysis of the human oral microbiome

USEARCH software and documentation Copyright Robert C. Edgar All rights reserved.

Microbiomics I August 24th, Introduction. Robert Kraaij, PhD Erasmus MC, Internal Medicine

GPU-Meta-Storms: Computing the similarities among massive microbial communities using GPU

Evaluation of the liver abscess microbiome and liver abscess prevalence in cattle reared for production of natural branded beef

Introduction to CGE tools

Chapter 7. Motif finding (week 11) Chapter 8. Sequence binning (week 11)

ChIP. November 21, 2017

12/8/09 Comp 590/Comp Fall

Gene Expression Data Analysis

Distribution-Based Clustering: Using Ecology To Refine the Operational Taxonomic Unit

Reducing the Effects of PCR Amplification and Sequencing Artifacts on 16S rrna-based Studies

a-dB. Code assigned:

Exploring Microbial Diversity and Taxonomy Using SSU rrna Hypervariable Tag Sequencing

UTAX accurately predicts taxonomy of marker gene sequences

Dr. Sabri M. Naser Department of Biology and Biotechnology An-Najah National University Nablus, Palestine

Computational Systems Biology Deep Learning in the Life Sciences

Nature Biotechnology: doi: /nbt.3943

Supplementary Information for

NB! 6. Global key annotations

Contact us for more information and a quotation

Introducing DOTUR, a Computer Program for Defining Operational Taxonomic Units and Estimating Species Richness

Machine Learning. HMM applications in computational biology

Just the Facts: A Basic Introduction to the Science Underlying NCBI Resources

Genomic resources. for non-model systems

Lecture 7. Next-generation sequencing technologies

Conducting Microbiome study, a How to guide

Sequence Analysis. II: Sequence Patterns and Matrices. George Bell, Ph.D. WIBR Bioinformatics and Research Computing

Using Ensembles of Hidden Markov Models in metagenomic analysis

Applications of Next Generation Sequencing in Metagenomics Studies

a-dB. Code assigned:

ALGORITHMS IN BIO INFORMATICS. Chapman & Hall/CRC Mathematical and Computational Biology Series A PRACTICAL INTRODUCTION. CRC Press WING-KIN SUNG

NGS part 2: applications. Tobias Österlund

Figure S4 A-H : Initiation site properties and evolutionary changes

Sequence assembly. Jose Blanca COMAV institute bioinf.comav.upv.es

Accurate determination of microbial diversity from 454 pyrosequencing data

From Variants to Pathways: Agilent GeneSpring GX s Variant Analysis Workflow

Novel bacterial taxa in the human microbiome

Utilizing novel diversity estimators to quantify multiple dimensions of microbial biodiversity across domains

A FRAMEWORK FOR ANALYSIS OF METAGENOMIC SEQUENCING DATA

A comparison of sequencing platforms and bioinformatics pipelines for compositional analysis of the gut microbiome

Introduction to Microbial Community Analysis. Tommi Vatanen CS-E Statistical Genetics and Personalised Medicine

Evaluating the accuracy of amplicon-based microbiome computational pipelines on simulated human gut microbial communities

Genome sequence of Brucella abortus vaccine strain S19 compared to virulent strains yields candidate virulence genes

on November 21, 2018 by guest

CS 262 Lecture 14 Notes Human Genome Diversity, Coalescence and Haplotypes

Spectral Counting Approaches and PEAKS

Tracking the units of microbial communities with metagenomics

Supplementary Material

HHS Public Access Author manuscript Nat Methods. Author manuscript; available in PMC 2016 November 23.

ngs metagenomics target variation amplicon bioinformatics diagnostics dna trio indel high-throughput gene structural variation ChIP-seq mendelian

A step towards understanding Polycyclic Aromatic Hydrocarbons degradation by microbial communities

Association Mapping in Plants PLSC 731 Plant Molecular Genetics Phil McClean April, 2010

Runs of Homozygosity Analysis Tutorial

Transcription:

Introduction to OTU Clustering Susan Huse August 4, 2016

What is an OTU? Operational Taxonomic Units a.k.a. phylotypes a.k.a. clusters aggregations of reads based only on sequence similarity, independent of taxonomic name

The Magical 3% NOT! 3% SSU OTUs = Species and 6% SSU OTUs = Genera

History of The 3% OTUs Wayne et al (1987) IJSEM 37(4): 463-464 Uses genome-level phylogeny to define species, using 70% DNA-DNA reassociation. Stackebrandt and Goebel (1994) IJSEM 44(4): 846-849 70% DNA-DNA ~ 97% similarity of full-length 16S Therefore 97% similarity of any hypervariable region in any bacteria or fungus = species?

Cumulative Percentage of Non-Singleton Taxa 18% of genera are > 10% OTUs 30% of genera are < 3% OTUs Genus (827) Family (244) Order (92) Class (52) Phylum (31) Fig 1A Schloss & Westcott (2011) Appl Env Microbiol Maximum Intra-Taxon Distance for Full-length 16S

Why OTUs? Because we can! We can t: do whole genome sequencing of communities assign complete taxonomic names derive phylogenetic trees from millions of short tags Continuing to develop new methods that are more biologically or evolutionarily meaningful Come back in 10 years

Pros and Cons OTUs Novel organisms Taxonomy Universal names Insufficient taxonomy Does not lump together all order or family-level classifications Many names based on phenotype rather than genotype Meaning associated with names Independent of clustering width and algorithm Historically well-studied are split New areas are lumped

Cluster Width Diameter Sequences are never more than D apart. (CL) Radius Sequences are never more than R from seed or reference. (Ref, Greedy) Variable Only selected positions (Oligo) Averaged radii (AL)

Primary Aggregation Methods Complete linkage - no two sequences are farther apart than the clustering width Average linkage - the average distance between sequences is less than the width Greedy clusters sequences are less than the width to OTU representatives built de novo Reference sequences are less than the width to universal reference set of OTU representatives Oligotyping sequences are equivalent at most informationrich nucleotide positions Probabilistic error modeling of likely clusters

De Novo Greedy Clustering Most abundant reads: are likely true sequences least likely to be errors or chimeras, and most likely to spawn errors and chimeras Seed OTUs with most abundant, Assign variation to most abundant source.

De Novo Greedy Clustering Sort by abundance First Seq 1 is seed of Cluster 1 Compare next Seq to each cluster in order: if Distance(Seq, Seed) <= W, include it in the cluster if Seq is not a member of any cluster, seed a new cluster Edgar (2010) Bioinformatics 26(19):2460-1

Greedy Seeds > W Each tag goes to OTU with nearest representative, not OTU with the nearest sequence

5. Bin each tag to that reference s OTU#, or unclustered if > Width Rideout et al (2014) PeerJ in press Closed Reference OTUs 1. Create a set full-length reference sequences (e.g., Greengenes) 2. Cluster reference sequences 3. Select representative sequence for each cluster 4. Map each tag to nearest rep <= Width

Open Reference OTUs 1. Perform Closed Reference OTU mapping 2. Perform de novo clustering on unclustered sequences 3. Subsequent datasets can be mapped to the reference and to an expanding set of de novo clusters.

Reference OTU Caveat X% clustering of full-length reference sequences does not represent X% for each hypervariable region. Will vary by phylotype

Oligotyping Calculated Entropy Alignment positions of greatest information Oligotypes identical in selected positions Ignore the rest as noise Alignment Position Eren et al (2013) Methods in Ecol and Evol 4:1111-9

Pros and Cons Greedy, AL - standard de novo methods, abundance dependent, project-specific Reference OTU compare V-regions, universal IDs, dependent on reference database and reference clustering Oligotyping / DADA2 increased sensitivity, no specific radius, projectspecific

The Problem of Inflation Clustering algorithms return more OTUs than predicted for mock communities. OTU inflation leads to: alpha diversity inflation beta diversity inflation Where does this inflation come from? How can we adjust for this inflation?

OTU inflation varied with sample depth 7000 MS-CL M2FN Rarefaction - PML OTUs 6000 5000 4000 3000 2000 1000 0-20,000 40,000 60,000 80,000 100,000 120,000 Number of Sequences Sampled 5K 10K 15K 20K 50K 100K

Low-Quality Reads and Errant OTUs Word to the wise: minimize adventurous OTUs

Different clustering algorithms have very different effects on the size and number of OTUs created

Inflation in Action >200,000 V6 reads of E. coli clone (454) 1,042 is a few more than the expected 2 Huse et al (2010) Environ Microbiol 12(7): 1889-98

Sequencing Noise in E.coli Distance Tags Seqs 0 177,697 2 82.4% 0.02 33,425 633 97.9% 0.03 3738 2268 99.6% 0.04 2 1 99.6% 0.05 635 306 99.9% 0.06 38 25 99.9% 0.07 89 72 100.0% 0.08 22 17 100.0% 0.09 0 0 100.0% 0.10 4 4 100.0% > 0.10 23 16 100.0% Cluster at 3% using only tags within 3% of the correct sequence

Inflation with small sequence error Only 129 OTUs much better!!

Clustering V6 sequences within 3% of known templates (outliers removed) E. coli (2) S. epidermidis (1) Clone43 (43) MS-CL 129 / 89 / 694 Clustering method MS-AL 54 / 44 / 218 Alignment method PW-CL 6 / 5 / 308 PW-AL 2 / 1 / 43 MS Multiple Sequence Alignment PW Pairwise Alignment CL Complete Linkage clustering AL Average Linkage clustering

Multiple Sequence Alignments Regardless of clustering algorithm, an MSA cannot fully align tags whose sequences are too divergent, more tags cause more problems 18,156 sequences and 392 positions

MSA Limitations Distance Comparison Independent Distances Subsampled Distances X-axis: 1000 distances subsampled from MSA of 40,000 sequences Y-axis: 1000 distances from MSA of only the 1,000 bacterial sequences

Cluster Count: 1234567 Complete Linkage propagates OTUs #5 #6 #9 #8 #4 #3 #1 #12 #11 #2 #7 #10 #13 Each V6 variant with 2nt distance, will start a new OTU. Each V6 variant with 1nt distance, can cluster with the seed or a 2nd variant.

Average Linkage collapses errors Cluster Count: 1 #1 Clusters tend to be heavily dominated by their most abundant sequence, which strongly weights the average and smoothes the noise.

Greedy Seeds (UClust) Cluster Count: 1

Effect of Sampling Depth OTUs 7000 6000 5000 4000 3000 2000 1000 MS-CL M2FN Rarefaction - PML 5K 10K 15K 20K 50K 100K Previous algorithms differentially inflated OTUs at a sampling depth dependent upon total sample depth 0-20,000 40,000 60,000 80,000 100,000 120,000 Number of Sequences Sampled Newer methods do not

Errant OTUs If a low-quality sequence is >3% from its source, it can create a new OTU. If the rate of an errant read = 1 in 10 thousand, and we have 1 million reads: 0.0001 * 1,000,000 reads = 100 errant OTUs

Impact of Errant OTUs Errant OTUs as percent of OTUs decreases with diversity. If we have 100 errant OTUs: Mock community: 100 / 50 OTUs = +200% Diverse community: 100 / 2,000 OTUs = +5%

Relative Inflation Absolute number of errant OTUs will increase with sample size. Relative number of errant OTUs will decrease with sample complexity

What is the right cluster width? No.

What is the right cluster width? Distances vary by hypervariable region Full-length does not equal V1-V3 which does not equal V3-V5 which does not equal V6, etc. Distance vary by bacterial lineage What specificity are you trying to gain? No single clustering width is the right answer, use several (and wave your hands a lot) or use other methods like Oligotyping

Questions about Clustering How meaningful are clusters functionally? When is an errare rare and when is it an error? Should it be included in an existing cluster or start its own? How to assign sequences if OTUs overlap? What is the effect of residual low quality data or chimeras? How sensitive are alpha and beta diversity estimates to clustering results?