I AM NOT A METAGENOMIC EXPERT. I am merely the MESSENGER. Blaise T.F. Alako, PhD EBI Ambassador

I AM NOT A METAGENOMIC EXPERT I am merely the MESSENGER Blaise T.F. Alako, PhD EBI Ambassador blaise@ebi.ac.uk

Hubert Denise Alex Mitchell Peter Sterk Sarah Hunter

http://www.ebi.ac.uk/metagenomics Blaise T.F. Alako EBI Ambassador blaise@ebi.ac.uk

Where is the true cost of NGS? 70 % (~80 bp/$) 14.5 % 28 % (~2m bp/$) 14.5 % 30 % 4.5 % 14.5 % 36.5 % 55 % Sboner et al. Genome Biology (2011) 12:125

EBI Metagenomics pipeline Philosophy Overview data analysis Data analysis using selected EBI and external software tools QC steps + tutorial Overview of functional analysis Result outputs Others public pipelines

Philosophy behind EBI Metagenomics pipeline Helping metagenomics researchers make sense of their data From chaos to structure: archiving of data with metadata performing stringent QC filtering prior to analysis quality in, quality out performing robust taxonomy and functional analysis model-based rather than similarity-based approaches assignment done on reads rather than assembly intuitive navigation through website constant drive to improvement benchmarking and tool testing

EBI Metagenomics currently do not perform assembly Why? absence of reference genome short reads make chimaera inevitable Ex: re-analysis of Hess et al, Science (2011) 331:463 What are the consequences? cannot link taxonomy information to functional annotations cannot currently perform viral taxonomy analysis

Metagenomics data analysis Diversity analysis Quality control Functional analysis Image credits: (1) Christina Toft & Siv G. E. Andersson; (2) Dalebroux Z D et al. Microbiol. Mol. Biol. Rev. 2010;74:171-199

Overview of EBI Metagenomics Pipeline raw reads Quality control discarded reads that fail QC Amplicon -based data processed reads rrnaselector reads with rrna Qiime reads without rrna FragGeneScan predicted CDS InterProScan Unknown function pcds Taxonomic analysis Function assignment

EBI Metagenomics pipeline Philosophy Overview data analysis QC steps + tutorial Overview of functional analysis Data analysis using selected EBI and external software tools Result outputs Others public pipelines

EBI Metagenomics: QC rationale Why? Garbage in, garbage out Base call error: - each base call has a quality score associated - platform-dependent errors Reads quality decreases with reads length NGS generates duplicate reads (false and real). Reducing duplication reduces analysis time and prevent analysis bias.

EBI Metagenomics: QC step by step Clipping - low quality ends trimmed and adapter/barcode sequences removed using Biopython SeqIO package Quality filtering - sequences with > 10% undetermined nucleotides removed Read length filtering - depending on the platform short sequences are removed Duplicate sequences removal - clustered on 99% identity (UCLUST v 1.1.579) and representative sequence chosen Repeat masking - RepeatMasker (open-3.2.2), removed reads with 50% or more nucleotides masked

EBI Metagenomics: QC consequences Roche 454 Ion Torrent Illumina

EBI Metagenomics: overview of functional analysis reads without rrna FragGeneScan predicted CDS InterProScan Unknown function pcds Function assignment

EBI Metagenomics: identification of coding sequences Prediction of coding sequences is a challenge read length sequencing errors: frame-shift Two main types of approaches: homology-based methods: identify only known coding sequences feature-based approaches: predict probability that ORF are coding EBI Metagenomics uses FragGeneScan : hidden Markov models to correct frame-shift using codon usage probabilistic identification of start and stop codons 60 bp minimum ORF Rho et al. (2010) NAR 38-20

EBI Metagenomics: annotation of coding sequences Most available pipelines use homology-based methods (such as BLAST) compare a query sequence with a database of sequences identify database sequences that resemble the query sequence with homology score above a certain threshold However sequences may appear to have low homology score because: proteins may share homology only in limited domains proteins from different species can differ in length EBI Metagenomics pipeline do not use pairwise similarity based methods to associate functions to predicted protein sequences instead we use InterProScan to mine the InterPro database

EBI Metagenomics: Avantage of InterPro InterPro database (HMM and profile based functional analysis) based on presence of signatures (models) from several databases Specificity: mapping is manually curated BLAST vs. UniRef100 hit C7VBM8, Predicted protein C7VC62, Predicted protein InterProScan hit 5-formyltetrahydrofolate cyclo-ligase-like (IPR024185) Transcription regulator HTH, LysR (IPR000847) Speed Test set of 40692 predicted protein sequences BLAST vs UniRef100 = 21.5 s/cds InterProScan (5 databases) = 3 s/cds

EBI Metagenomics: overview of taxonomy analysis processed reads rrnaselector reads with rrna Amplicon-based data Qiime Taxonomic analysis

EBI Metagenomics: identification of suitable sequences Taxonomy analysis is generally based on identification and classification of rrna sequences Prokaryotes: archaebacteria and eubacteria: 5S, 16S and 23S Eukaryotes: 5S, 5.8S, 18S and 28S there is no equivalent for virus so depend on DNA polymerase or part of 5 -UTR (internal ribosomal entry site [IRES]) sequences EBI Metagenomics currently only provide taxonomy analysis for Prokaryotes. rrna sequences are identified using rrnaselector : hidden Markov models to identified rrna sequences 60 bp minimum overlap with well-curated HMM model E-value < 10-5 Lee et al (2011) J Microbiol. 49(4)

EBI Metagenomics: identification of suitable sequences Once identified, rrna sequences are clustered and classified using Qiime QIIME stands for Quantitative Insights Into Microbial Ecology. QIIME is an open source software package for comparison and analysis of microbial communities The main steps are: clustering sequences in Operational Taxonomy Unit (OTU) using uclust picking a representative sequence set (one sequence from each OTU) aligning the representative sequence set assigning taxonomy to the representative sequence set using PyNAST generating output files: filtering the alignment prior to tree building building phylogenetic tree creating OTU table

EBI Metagenomics pipeline in a nut shell QC : - trim adaptor sequences, low quality sequence ends - remove duplicates and short sequences - remove low complexity sequences, Powerful and sophisticated alternative to BLAST-based functional metagenomic analysis Diversity analysis : - identify prokaryotic rrnasequences (5, 16 and 23s) - cluster rrna-containing reads - assign taxonomy classification using Qiime, Functional analysis : - predict ORFs - translate ORFs into peptides - submit to InterProScan for functional annotation

EBI Metagenomics pipeline Philosophy Overview data analysis Data analysis using selected EBI and external software tools QC steps + tutorial Overview of functional analysis Overview of taxonomy analysis Result outputs Others public pipelines

Current outputs of EBI Metagenomics pipeline Visualisation Download - QC and sequence statistics - Diversity analysis - Functional analysis

EBI Metagenomics pipeline: taxonomy visualisation Google charts dynamic representation switch to bar chart, column or Krona interactive views

EBI Metagenomics pipeline: functional visualisation Google charts dynamic representation Interpro matchers links to InterPro website Gene ontology

EBI Metagenomics pipeline : download options Large starting material Small size output for post-processing

Some other Metagenomics tools http://ab.inf.uni-tuebingen.de/software/megan/ http://www.computationalbioenergy.org/software.html http://cbcb.umd.edu/software/metamos

Public Metagenomics portals http://www.ebi.ac.uk/metagenomics/ http://metagenomics.anl.gov/ http://camera.calit2.net/ http://img.jgi.doe.gov/

http://www.ebi.ac.uk/metagenomics Thanks to EMG Team, InterPro team and you for your attention