Introduction to Microbial Community Analysis Tommi Vatanen CS-E5890 - Statistical Genetics and Personalised Medicine
Structure of the lecture Motivation: human microbiome Terminology Data types, analysis Examples from human microbiome project (HMP) and DIABIMMUNE project
Our microbial selves: microbes are in, on & around us
Our microbial selves: microbes are in, on & around us More microbial cells than human cells in the human body (1-2 kg, mostly in gut) 1000s of species, each containing 1000s of genes (outnumber human genes 100:1) Under ideal conditions Aid in digestion, make nutrients (vit. K), keep bad guys out, train immune system Under non-ideal conditions Predispose, exacerbate, or directly cause deviations from health
What is metagenomics? Total collection of microorganisms within a community Also microbial community or microbiota Total genomic potential of a microbial community Study of uncultured microorganisms from the environment, which can include humans or other living hosts Total biomolecular repertoire of a microbial community
Sequencing techniques Massive parallel DNA sequencing revolutionized the study of microbial communities No need to isolate bacteria in lab Purify DNA and sequence Golden age of microbial community studies
What to do with your metagenome? Basic science Reservoir of gene and protein functional information Comprehensive snapshot of microbial ecology and evolution Translational science Public health tool monitoring population health and epidemiology Diagnostic or prognostic biomarker for host disease
Examples of metagenomic studies: Global ocean sampling 2003/2004 - ongoing
The NIH Human Microbiome Project (HMP): A comprehensive microbial survey What is a normal human microbiome? 300 healthy human subjects Multiple body sites 15 male, 18 female Multiple visits Clinical metadata www.hmpdacc.org
DIABIMMUNE study on the infant gut microbiome Follow developing infant gut microbiome in Finland, Estonia and Russian Karelia 222 infants, at risk for autoimmune diseases by genotype Monthly stool samples from birth until 3 years Clinical metadata: Diet, antibiotics, mode of birth, vaccinations https://pubs.broadinstitute.org/diabimmune/
Talking about microbes: Phylogenies OTU = operational taxonomic unit
Talking about microbes: Relative abundance Absolute abundance is always masked in data obtained by techniques discussed here Information is measured in relative abundances 30 % of the bacteria are XXX,
Talking about microbes: Abundance vs. prevalence Abundant but not prevalent Prevalent but not abundant Abundant and prevalent
Talking about microbes: diversity Diversity: broadly, a community s number and distribution of organisms Also community composition or structure Alpha-diversity refers to a diversity of a community (sample) Beta-diversity refers to dissimilarity between two communities
Talking about microbes: Alpha-diversity (1-sample) scenarios Not diverse Qualitatively diverse Taxonomically diverse Phylogenetically diverse Quantitatively diverse Taxonomically diverse
Talking about microbes: measures for alpha-diversity Richness: number of unique taxa Richness estimates (how many unobserved taxa?) Chao1 f 1 is the number of singleton taxa (observed only once, one read) and f 2 is the number of doubleton taxa Diversity as considered in information theory, entropy Shannon s diversity index p i is the relative abundance of taxon i Many other measures: Simpson, McIntosh, Berger-Parker, Vegan::diversity() in R
Alpha-diversity of the gut microbiome increases during first years of life Microbiome complexity & stability Birth 3 yrs Adult Elderly Kostic, A. D., Xavier, R. J., & Gevers, D. (2014). The microbiome in inflammatory bowel disease: current status and the future ahead. Gastroenterology, 146(6), 1489 99.
Increasing diversity in DIABIMMUNE Increase in diversity during first three years of life New microbes colonize the gut with increasing complexity of diet, environmental exposures, etc.
Talking about microbes: Beta-diversity (2-sample) scenarios Sample 1 Sample 2 Qualitatively diverse Taxonomically diverse Quantitatively diverse Taxonomically diverse Quantitatively diverse Phylogenetically diverse
Talking about microbes: measures for beta-diversity Jaccard index, proportion of shared taxa Bray-Curtis dissimilarity where C is the sum of the lesser values for only those species in common between both samples. S are the total number of species per sample. vegan::vegdist in R 20
UniFrac beta-diversity accounts for the phylogeny Raw weighted UniFrac metric Where n is the total number of branches in the tree, b i is the length of branch i, A i and B i are the number of descendants of branch i from communities A and B respectively, and A T and B T are the total number of sequences from communities A and B respectively Lozupone, C.; Knight, R. (2005). "UniFrac: A New Phylogenetic Method for Comparing Microbial Communities". Applied and Environmental Microbiology 71 21
Talking about microbes: ordination Ordination is a constrained projection of high-dimensional data into fewer dimensions Principal component analysis (PCA) guarantees the new dimensions to maximize normal variation Principal coordinates analysis (PCoA) denotes to any ordination method based on (dis)similarity matrix Nonmetric multidimensional scaling (NMDS) based on UniFrac beta-diversity is widely used in microbial community analysis Hamady, 2009
t-distributed stochastic neighborhood embedding Modern, distance / similarity matrix based technique for visualizing (highdimensional) data Find mapping / visualization which is faithful to the original local neighborhoods in the data Data points similar in the input data tend to be close in the visualization Rtsne::Rtsne in R
What aspects of a human host most influence microbial community composition? Rob Knight ~5,200 microbial communities profiled by 16S sequencing (closer = more similar)
How about infant gut microbiome? Variation in the infant gut microbiome is dominated by the age In DIABIMMUNE, Russians seem to have distinct microbiota compared to Finns and Estonians 25
Two big questions of microbial community analysis Who is there? What are they doing?
How to obtain data on microbes? Cultivate single strains of bacteria Traditional microbiology + sequencing Sequencing based methods for studying microbial communities Purify all DNA and sequence Amplicon-based methods target specific regions/genes of interest Shotgun sequencing for all DNA material Differences between sequencing methods Short vs. long reads Errors are more problematic than in e.g. human genome analysis
Sequencing as a tool for microbial community analysis (amplicon vs. shotgun) Lyse cells Extract & fragment DNA Features Samples Relative abundance Sequence short DNA reads 16S (18S, ITS) rrna gene Conserved across bacteria (Allows PCR amplification) Some regions are variable Permits genus-level ID Map reads to reference genomes AGCTAGA CCGATCG TTAGCAC ACTAGCA Assemble into contigs AGCTACAGC ACAGCACGGCAT GGCATCATC AGCTACAGCACGGCATCATC 28
Typical microbiome community analysis tasks Metagenomic data Stats 16S data 29 29
Two big questions of microbial community analysis Who is there? What are they doing?
Metagenomic methods: 16S rrna gene Structural component of the prokaryotic ribosome Used as molecular clock to identify phylogeny: Large, good scale for mutations Portions are constant, allowing amplification Relatively cheap Woese, 1987 Pace, 1997 V6 George Rice, Montana State University Ley, 2006 V2 31
Microbiome composition analysis: phylotypes and binning Binning: nontrivial assignment of reads to phylotypes or OTUs (=clustering / classification) Phylotype or operational taxonomic unit (OTU): organisms clonal to within some tolerance (e.g. 97%); species
Microbiome composition analysis: operational taxonomic unit (OTU) binning Open reference Clustering AAA AAG AAT TGA >Uniq1 AAA >Uniq2 TGA >Uniq3 TTT Closed reference Classification TTT TGG
QIIME for analysing amplicon sequencing data QIIME (pronounced chime) is a modular open-source bioinformatics pipeline for analysingmicrobial amplicon sequencing data Homepage qiime.org contains documentation, tutorials and other resource material Huge collection of scripts for many different analysis tasks
QIIME for analysing amplicon sequencing data
Profiling microbial communities by metagenomic shotgun sequencing Reference Genomes A Y X B Y Y C A X X B X Y C Short Reads 36
Indexing microbial pangenomes I II III I II IV III IV I II I II II IV III I II I I IV II V III II V NCBI isolate genomes Archaea 300 Bacteria 12,926 Viruses 4,646 Eukaryota 2,177 V V IV II III II Bags of protein coding genes 49.0 million total genes II IV III V Species pangenomes 7,677 containing 18.6 million gene clusters II V Core genes V Marker genes RepoPhlAn ChocoPhlAn (http://metaref.org)
MetaPhlAn Metagenomic Phylogenic Analysis Reference Genomes A Y X B Y Y C A X X B X Y C Short Reads 38
MetaPhlAn data, species x samples
Other software for taxonomic profiling motu (metagenomic OTU) http://www.bork.embl.de/software/motu/ MEGAN http://ab.inf.uni-tuebingen.de/software/megan6/ Kraken https://ccb.jhu.edu/software/kraken/
Two big questions of microbial community analysis Who is there? What are they doing?
Metagenomic analysis: molecular functions in biological roles Subjects Phylum abundance Phylum abundance Nares Skin Oral (BM) Oral (SupP) Oral (TD) Gut Vaginal Pathway abundance http://hmpdacc.org/hmmrc Pathway abundance Subjects
Metagenomic analysis: molecular functions in biological roles Orthology: Grouping genes by conserved sequence features COG, KO, FIGfam Structure: Grouping genes by similar protein domains Pfam, TIGRfam, SMART, EC Biological roles: Grouping genes by pathway and process involvement GO, KEGG, MetaCyc, SEED Warnecke, 2007 Turnbaugh, 2009 DeLong, 2006
From reads to genes (HUMAnN2) INPUT: Quality controlled metagenome (or metatranscriptome) Rapidly identify species in the community with MetaPhlAn2 Nucleotide search reads vs. pangenomes of identified species Translated search unclassified reads vs. non-redundant protein db Isolate novel reads for external assembly http://huttenhower.sph.harvard.edu/humann2 44
From reads to genes (HUMAnN2) IV II V Quality-controlled RNA or DNA seq reads Taxonomic profiling (MetaPhlAn 2) List of abundant organisms III II V KEY data input Analysis module Unmapped reads Nucleotide level pangenome mapping (Bowtie 2) Functionally annotated species pangenomes (ChocoPhlAn) data product Organism-agnostic translated search (diamond) Organism specific hits Universal protein reference database (UniRef) Hits to protein families http://huttenhower.sph.harvard.edu/humann2 HUMAnN core algorithms Pathway collection (MetaCyc) 45
Body site-specific signature pathways in the human microbiome Note typically large abundance relative to other body sites Note relatively small % of pathway copies unclassified L-rhamnose degradation (RHAMCAT-PWY) emerged as a signature of the human gut microbiome across >900 first-visit HMP1-II metagenomes analyzed
Body site-specific signature pathways in the human microbiome Max area 2% relative abundance (other areas square-root scaled) signature for area i Q1( area i ) > Q3( area j ) for all j i; very stringent! 50 total signature pathways across 4 major body areas Values plotted = median (Q2) abundance for samples from that area 47
Which functions of microbiome are disrupted in IBD? Over six times as many microbial metabolic processes disrupted in IBD as microbes. If there s a transit strike, everyone driving a bus in Helsinki is disrupted, not everyone named Virtanen or Doe Phylogenetic distribution of function is consistent but diffuse During IBD, microbes... Stop Creating most amino acids Degrading complex carbs. Producing short-chain fatty acids Start Taking up more host products Dodging the immune system Adhering to and invading host cells
Confounding effects in real world data Biology is complicated, everything affects everything Scientist cannot control everything, in observational cohorts they are not even trying to Observed associations may be explained by confounding factors
Confounding effects in psychology Classical example: drowning incidents and ice cream sales are highly positively correlated Explanations Possibility #1: People drowning causes other people to purchase ice cream Possibility #2: Purchasing ice cream causes people to drown Possibility #3: There is a third variable (confounding variable) that causes the increase in both ice cream sales and drowning incidents The weather confounds the relationship between ice cream sales and drowning incidents Confounding variables are common in microbiome studies Lots of environmental factors affect the gut microbiome
Solution #1 post hoc checking of results Consumption of vegetables is correlated with species X Check if any other collected metadata, information about the study subjects, is correlated or associated with the consumption of vegetables No: you did not see any confounding factors but there still might be some Yes: Can you stratify your analyses to further confirm the finding E.g.: females consume more vegetables and have more species X Does the correlation hold with females/males only
Solution #2 Design and conduct a controlled experiment Consumption of vegetables is correlated with species X Design an experiment where subjects are randomly assigned to consume 1) a lot, or 2) no vegetables Control known confounders E.g. both groups contain same amount of males and females
Solution #3 Statistical modeling Test if the correlation / association holds after correcting for the confounding effects statistically Linear models easy to understand and computationally low cost
Lipid A biosynthesis in DIABIMMUNE infants
Typical microbiome community analysis tasks Metagenomic data Stats 16S data 55 55