Strategies and Techniques for Analyzing Microbial Population Structures Introduc)on to QIIME on the IPython Notebook Rob Knight Adam Robbins- Pianka Will Van Treuren Yoshiki Vázquez- Baeza ( @yosmark ) Luke Ursell
A microbe dominated world The universal nature of biochemistry. Pace NR. Proc Natl Acad Sci U S A. 2001 Jan 30;98(3):805-8.
Vast microbial diversity in every question: ecosystem, how including human our are own we? Human: 10 trillion human cells 20,000 human genes Microbiota: 100 trillion microbial cells Microbiota: 2-20 million microbial genes 99.9% of our genomes the same, but our microbes...?
How do we assay this diversity?
Sequencing output (454, Illumina, Sanger) fastq, fasta, qual, or sff/trace files Metadata mapping file www.qiime.org Pre-processing e.g., remove primer(s), demultiplex, quality filter OTU (or other sample by observation) table Phylogenetic Tree Evolutionary relationship between OTUs Denoise 454 Data Database Submission α-diversity and rarefaction β-diversity and rarefaction PyroNoise, Denoiser (In development) e.g., Phylogenetic Diversity, Chao1, Observed Species e.g., Weighted and unweighted UniFrac, Bray- Curtis, Jaccard Pick OTUs and representative sequences Reference based BLAST, UCLUST, USEARCH De novo e.g., UCLUST, CD-HIT, MOTHUR, USEARCH Interactive visualizations e.g., PCoA plots, distance histograms, taxonomy charts, rarefaction plots, network visualization, jackknifed hierarchical clustering. Assign taxonomy BLAST, RDP Classifier Align sequences e.g., PyNAST, INFERNAL, MUSCLE, MAFFT Legend Currently supported for marker-gene data only Currently supported for general sample by observation data Build 'OTU table' i.e., sample by observation matrix Build phylogenetic tree e.g., FastTree, RAxML, ClearCut (i.e., 'upstream' step) Required step or input (i.e., 'downstream' step) Optional step or input
Samples to sequences Sequencing output (454, Illumina, Sanger) fastq, fasta, qual, or sff/trace files Metadata mapping file Pre-processing e.g., remove primer(s), demultiplex, quality filter Denoise 454 Data PyroNoise, Denoiser Database Submission (In development)
Error- correczng codes allow mulzplex sequencing >GCACCTGAGGACAGGCATGAGGAA >GCACCTGAGGACAGGGGAGGAGGA >TCACATGAACCTAGGCAGGACGAA >CTACCGGAGGACAGGCATGAGGAT >TCACATGAACCTAGGCAGGAGGAA >GCACCTGAGGACACGCAGGACGAC >CTACCGGAGGACAGGCAGGAGGAA >CTACCGGAGGACACACAGGAGGAA >GAACCTTCACATAGGCAGGAGGAT >TCACATGAACCTAGGGGCAAGGAA >GCACCTGAGGACAGGCAGGAGGAA >PC.634_1 FLP3FBN01ELBSX CTGGGCCGTGTCTCAGTCCCAATGTGGCCGTTTACCCTCTCAGGCCGGCTAC GCATCATCGCCTTGGTGGGCCGTTACCTCACCAACTAGCTAATGCGCCGCAG GTCCATCCATGTTCACGCCTTGATGGGCGCTTTAATATACTGAGCATGCGCT CTGTATACCTATCCGGTTTTAGCTACCGTTTCCAGCAGTTATCCCGGACACA TGGGCTAGG! >PC.354_3 FLP3FBN01EEWKD! TTGGACCGTGTCTCAGTTCCAATGTGGGGGCCTTCCTCTCAGAACCCCTATC CATCGAAGGCTTGGTGGGCCGTTACCCCGCCAACAACCTAATGGAACGCATC CCCATCGATGACCGAAGTTCTTTAATAGTTCTACCATGCGGAAGAACTATGC CATCGGGTATTAATCTTTCTTTCGAAAGGCTATCCCCGAGTCATCGGCAGGT TGGATACGTGTTACTCACCCGTGCGCCGGT! Micah Hamady, et al., Nature Methods, 2008. Error- correczng barcodes for pyrosequencing hundreds of samples in mulzplex.
Sequences to OTUs and Phylogeny Pick OTUs and representative sequences Reference based BLAST, UCLUST, USEARCH Assign taxonomy BLAST, RDP Classifier De novo e.g., UCLUST, CD-HIT, MOTHUR, USEARCH Align sequences e.g., PyNAST, INFERNAL, MUSCLE, MAFFT e.g. p Build 'OTU table' i.e., sample by observation matrix Build phylogenetic tree e.g., FastTree, RAxML, ClearCut
OTU Picking de- novo Clustering Algorithm Clustered Sequences TTGGAAGATGTCTCAGTTCCAG! TTGGAAGATGTCTCAGTTCCAG! TTGGAAGATGTCTCAGTTCCAG! TTGGAAGATGTCTCAGTTCCAG! TTGGGCCGTATGTCAGTCCCTA! TTGGAAGATGTCTCAGTTCCAG! TTGGGCCGTATGTCAGTCCCTA Experimental Sequences OTU1! OTUS OTU2! OTU3!
OTU Picking Closed Reference Reference! Sequences TTGGAAGATGTCTCAGTTCCAG! TTGGGCCGTATGTCAGTCCCTA! TTGGAAGATGTCTCAGTTCCAG! TTGGGCCGTATGTCAGTCCCTA Sequences that hit a reference TTGGAAGATGTCTCAGTTCCAG! TTGGAAGATGTCTCAGTTCCAG! TTGGAAGATGTCTCAGTTCCAG! Sequences that failed to hit TTGGAAGATGTCTCAGTTCCAG! TTGGGCCGTATGTCAGTCCCTA! TTGGAAGATGTCTCAGTTCCAG! TTGGGCCGTATGTCAGTCCCTA Experimental Sequences OTUS OTU1! OTU1! OTU1!
OTU Picking Open Reference Reference! Sequences TTGGAAGATGTCTCAGTTCCAG! TTGGGCCGTATGTCAGTCCCTA! TTGGAAGATGTCTCAGTTCCAG! TTGGGCCGTATGTCAGTCCCTA Sequences that hit a reference TTGGAAGATGTCTCAGTTCCAG! TTGGAAGATGTCTCAGTTCCAG! TTGGAAGATGTCTCAGTTCCAG! Sequences that failed to hit TTGGAAGATGTCTCAGTTCCAG! TTGGGCCGTATGTCAGTCCCTA! TTGGAAGATGTCTCAGTTCCAG! TTGGGCCGTATGTCAGTCCCTA Experimental Sequences Clustering Algorithm OTU4! OTU5! OTU6! OTU1! OTUS OTU2! OTU3!
CompuZng alpha and beta diversity OTU (or other sample by observation) table Phylogenetic Tree Evolutionary relationship between OTUs α-diversity and rarefaction e.g., Phylogenetic Diversity, Chao1, Observed Species β-diversity and rarefaction e.g., Weighted and unweighted UniFrac, Bray- Curtis, Jaccard
Comparing microbial communizes Who s there? How many are are there? α (i.e., within sample) diversity How similar are any two samples? Treatments? β (i.e., between sample) diversity
PhylogeneZc Diversity (PD): a qualitazve, phylogenezc α- diversity metric Sum of branch length covered by a sample Faith DP (1992) ConservaZon evaluazon and phylogenezc diversity. Biological ConservaZon. 61:1-10.
Unweighted UniFrac: a qualitazve, phylogenezc β- diversity metric IdenZcal communizes D = 0.0 Related communizes D ~ 0.5 Unrelated communizes D = 1.0 Percent of observed branch length that is unique to either sample Lozupone and Knight, 2005, Appl Environ Microbiol 71:8228
Clustering by UniFrac distance
Extract DNA and amplify marker gene with barcoded primers Pool amplicons and sequence www.qiime.org >GCACCTGAGGACAGGCATGAGGAA >GCACCTGAGGACAGGGGAGGAGGA >TCACATGAACCTAGGCAGGACGAA >CTACCGGAGGACAGGCATGAGGAT >TCACATGAACCTAGGCAGGAGGAA >GCACCTGAGGACACGCAGGACGAC >CTACCGGAGGACAGGCAGGAGGAA >CTACCGGAGGACACACAGGAGGAA >GAACCTTCACATAGGCAGGAGGAT >TCACATGAACCTAGGGGCAAGGAA >GCACCTGAGGACAGGCAGGAGGAA Assign reads to samples RefSeq 1 RefSeq 2 RefSeq 3 RefSeq 4 RefSeq 5 RefSeq 6 RefSeq 7 RefSeq 8 RefSeq 9 RefSeq 10 Assign millions of sequences from thousands of samples to OTUs Compute UniFrac distances and compare samples
Key QIIME files Mapping file: per sample meta- data, user- defined OTU table: sample x OTU matrix, central to downstream analyses [now in biom format] Parameters file: defines analyses, for use with the workflow scripts (opzonal)
Parameters Can Be Set In a Few Ways qiime_config files Environment Variable $QIIME_CONFIG_FP User s home directory Parameter files Command line
Mapping file
Mapping file: always run check_id_map.py! = required field
OTU table (classic format) sample x OTU matrix
OTU table (classic format) sample x OTU matrix OTU idenzfiers
OTU table (classic format) sample x OTU matrix Sample idenzfiers
OTU table (classic format) sample x OTU matrix OpZonal per OTU taxonomic informazon
OTU tables are now in biological observazon matrix (.biom) format (QIIME 1.4.0- dev and later) Google: biom format hsp://biom- format.org See convert_biom.py for translazng between classic and biom otu tables
sample x observa/on con/ngency matrix OTUs Samples Observa/on counts
sample x observa/on con/ngency matrix Functions Metagenomes Observa/on counts
sample x observa/on con/ngency matrix Samples Genomes Samples OTUs Marker gene (e.g., 16S) surveys Ortholog groups ComparaZve genomics Taxa Marker gene (e.g., 16S) surveys Functions Metagenomes Metagenomics Metabolites Samples Metabolomics... Metatranscriptomics
The Biological ObservaZon Matrix (BIOM) Format or: How I Learned To Stop Worrying and Love the Ome- ome JSON- based format for represenzng arbitrary sample x observazon conzngency tables with opzonal metadata McDonald et al., GigaScience (2012). hsp://www.biom- format.org
Running QIIME NaZve installazon on Mac (OS X) or Linux From laptops to 16,000+ core compute cluster qiime- deploy Ubuntu Virtual Box Cloud- based installazons hsp://ncar.janus.rc.colorado.edu/
Amazon ElasZc Compute Cloud (EC2)
Moving Pictures of the Human Microbiome Two subjects sampled daily, one for six months, one for 18 months Four body sites: tongue, palm of le{ hand, palm of right hand, and gut (via fecal swabs). Caporaso JG et al. (2011) Moving pictures of the human microbiome. Genome biology 12: R50.
Moving Pictures of the Human Microbiome InvesZgate the relazve temporal variability of body sites. Is there a temporal core microbiome? Technical points: do we observe the same conclusions on 454 and Illumina data?
Moving Pictures of the Human Microbiome: QIIME tutorial A small subset of the full data set to facilitate short run Zme: ~0.1% of the full sequence colleczon. Sequenced across six Illumina GAIIx lanes, with a subset of the samples also sequenced on 454.
Tutorial Click on the link in the wiki. Find your user name in the notebook. It will look something like: wvtreuren_stamps_2013.ipynb Click this link. It will open in a new window. Don t do anything else un)l we complete the next 4 slides.
IPython reference IPython acts like a hybrid python/bash environment. The way we interact with the IPython notebook is through the cells
IPython reference Commands prefixed by a '!' character are issued to the shell (just like what your terminal runs). Commands not prefixed with '!' are issued to python, and behave as they normally would in python. Each 'cell' of the notebook is executable. ShiR+Enter (or the play buton) is the way you execute (or re- execute) the commands in a given cell. You must click in the cell to gain focus in that cell, and then type ShiR+Enter or hit the play buton
IPython reference Each executable has a prefix that shows you its status (if it has been run, if it hasn t been run, or if its szll running) Hasn t been run Has been run SZll running
Tree Building Experimental Sequences TTGGAAGATGTCTCAGTTCCAGA! TTGGGCCGTATGTCAGTCCCTAAGGAG! CTGGGCCGTGTCTCAGTCCCAATCA! TTGGAAGATGTCTCAGTTCCAGGGGCTATAA! TTGGGCCGTATGTCAGTCCCTACGTAACA Phylogeny! CTG-CGCCGTGTCTCAGT CCTC--AA! TTGGAAGATGTCTCAGT----TCCAGA! TTGGGCCGTATGTCAGTCCCTAAGGAG! CTG-GGCG--TGTCTCAGTCCCAATCA! TTGGAAGATGT--CTCAGT-GCTATAA! TTGG---ATGTCAGTCCCTACGTAACA Aligned! Sequences CTG-CGCCGTGTCTCAGT CCTC--AA! CG! C! TTGGAAGATGTCTCAGT----TCCAGA! AA! A! TTGGGCCGTATGTCAGTCCCTAAGGAG! GC! A! CTG-GGCG--TGTCTCAGTCCCAATCA! GG! G! TTGGAAGATGT--CTCAGT-GCTATAA! AA! A! TTGG---ATGTCAGTCCCTACGTAACA - Masked and aligned! sequences
In the ancient times of... 2012 We used KiNG for viewing 3D plots in QIIME.
It's 2013! Emperor
Description 3D visualizazon tool Cross- pla orm Integrates with QIIME and it's workflows Use case- driven Easy to use In aczve development hsp://www.khronos.org/webgl/ hsp://www.oracle.com/
hsp://24.media.tumblr.com/tumblr_m6q4dgigkw1qzjxifo1_1280.jpg
Issues, suggestions, feature requests? Contact us: o www.github.com/qiime/emperor Or contact the QIIME Forum o hsp://groups.google.com/group/qiime- forum
Now try the Taxa Summary Plots and OTU Category Significance seczons on your own