TECHNIQUES FOR STUDYING METAGENOME DATASETS METAGENOMES TO SYSTEMS.

Size: px
Start display at page:

Download "TECHNIQUES FOR STUDYING METAGENOME DATASETS METAGENOMES TO SYSTEMS."

Transcription

1 TECHNIQUES FOR STUDYING METAGENOME DATASETS METAGENOMES TO SYSTEMS. Ian Jeffery

2 What is metagenomics Metagenomics is the study of genetic material recovered directly from environmental samples modern genomic techniques allow the study of communities of microbial organisms no need for isolation no need for lab cultivation Allows the observation of alterations in the metagenome in different states (disease versus healthy)

3 Research and analytical challenges Investigate the system level functionality and taxanomic variability of the metagenome. Study the alterations in the metagenome in disease. Assembly of novel genomes new taxa new genes Integration of metagenomic datasets with each other and other data sources. Analysis and integration of next gen sequencing datasets through the application of denovo bioinformatic methodologies.

4 What Technology and to use Bacterial Populations 16s ribosomal RNA Whole Genome Shotgun Next Gen Sequencing OTUs, Taxonomy, abundance and diversity Assembly, gene abundance, taxonomy abundance, Functional Analysis

5 Example Pipeline for metagenomic analysis Amplicon sequencing Pre-processing of S sequencing data Quality filtering Quality filter Shotgun Sequencing 1. Remove human contamination 2. Remove low quality bases based on phred scores 3. Remove reads <60 bps 4. Remove duplicate reads Chimera removal Ad-Hoc Pipeline Denoising Clustering 97% sequence identity Classification of Taxa Generate Unifrac distances Visualisation and analysis. Assemble Assemble reads into contigs, trying 3 approaches (different k-mer) Predict genes Predict genes using MetageneMark Predict Pathways Cluster genes (uclust) and mblast representative genes against EggNog and KEGG database Determine Copy Numbers Align reads to genes using Bowtie to determine per sample copy numbers of genes

6 What Technology to use Bacterial Populations 16s ribosomal RNA Whole Genome Shotgun Amplicon sequencing is still most widely used method Next Gen Sequencing OTUs, Taxonomy, abundance and diversity Assembly, gene abundance, taxonomy abundance, Functional Analysis

7 Amplicon or Shotgun Metagenomics? Amplicon Cons 1. Different primers will have different detection efficiencies. 2. Sequencing errors may artificially inflate the diversity of the sample. 3. Function can only be predicted indirectly. Whole shotgun metagenomic Cons 1. Large datasets (Storage) 2. Large datasets (Computation) 3. Large datasets (Analysis and statistical significance) 4. Cost

8 PICRUSt Predicts metagenome functional content from amplicon datasets

9 SHOTGUN SEQUENCING, ASSEMBLY AND ANNOTATION { { Binning contigs into individual genomes { AmphoraNet METAGENassist MetaPhylan Real time metagenomics Comet ESOM MaxBin MetaBat

10 SHOTGUN METAGENOMICS USING MG-RAST 29 subjects representative of C/R/LS Shotgun metagenomics (27/29) Total extracted bacterial DNA sequenced In total 126 Gb of DNA sequenced at BGI 51mio 2x91bp Illumina reads/sample

11 METAGENOME ASSEMBLY Short next-generation sequencing reads Pros: high coverage & low cost/bp Cons: assembly needed for reliable annotation Single genome assembly methods E.g. SOAPdenovo, Mira, ABYSS, Velvet Assumes uniform sequence coverage Low coverage => sequencing errors => gap High coverage => repeats => gap

12 Single-genome assembler, e.g. Velvet Low-coverage Mid-coverage High-coverage Removed as errors Incomplete short contigs Repeat-labelled incomplete short contigs Metagenome assembler, e.g. Meta-velvet / IDBA-UD Low-coverage: species A Mid-coverage: species B High-coverage: species C Assembled species A Assembled species B Assembled species C

13 ASSEMBLY RESULTS FOR 27 SAMPLES *N50: 50% of the (meta)genome is assembled into contigs larger than this value Avg # of contigs = 24,000 ± 3,500

14 METAGENE predicts genes in two stages 1. Identify all possible ORFs Scored by their base compositions and lengths. Scores of neighboring ORFs. 2. Scoring of the putative genes: di-codon frequencies ORF length distributions orientation distances of neighboring ORFs 2.51mio predicted genes (MetaGene)

15 GENE NUMBER AND DIVERSITY

16 Principal Component Analysis of functional and phylogenetic annotations Principal Component Analysis of normalized Bray-curtis distances from SEED Subsystem mapping using a maximum e-value of 1e10, a minimum identity of 10 %, and a minimum alignment length of 20

17 How does community metagenome differ to that of long-stay subjects? Starch and sucrose metabolism Purine metabolism Pyrimidine metabolism How do community genes differ to Longstay Cysteine and methionine metabolism Glycine, serine, threonine Alanine aspartate glutamate metabolism

18 Open source system for annotation and comparative analysis of metagenomes 100% free to upload, store and analyse Browse and compare public and private (password-protected) metagenomes

19 VISUALISATIONS No Standard Visualisations

20 Carriage of microbial taxa varies while metabolic pathways remain stable within a healthy population.

21 Homology-based classification of patient-associated Prevotella.Four NORA subjects with a high abundance of Prevotella OTU4 were selected for shotgun sequencing and metagenome assembly. Jose U Scher et al. elife Sciences 2013;2:e01202

22 Cedar Ren "Science artist Martin Krzywinski has compiled a series of eyecatching circular diagrams based on relationships between the digits in pi"

23 Taxonomic fingerprint of ageing

24 Age-related trajectory of gut microbiome functions

25 Conclusions Metagenomic data is complex and the computational requirements associated with its analysis can be large. Web-based tools are available that have an easy-touse interface. What tools to apply will depend on what type of output you desire. Try to decide this early on in the analysis. If the dataset is large, where will the analysis take place? Functional prediction from 16S and small shotgun datasets can be a powerful combination for functional as well as taxonomic analysis.