I AM NOT A METAGENOMIC EXPERT I am merely the MESSENGER Blaise T.F. Alako, PhD EBI Ambassador blaise@ebi.ac.uk
Hubert Denise Alex Mitchell Peter Sterk Sarah Hunter
http://www.ebi.ac.uk/metagenomics Blaise T.F. Alako EBI Ambassador blaise@ebi.ac.uk
Where is the true cost of NGS? 70 % (~80 bp/$) 14.5 % 28 % (~2m bp/$) 14.5 % 30 % 4.5 % 14.5 % 36.5 % 55 % Sboner et al. Genome Biology (2011) 12:125
EBI Metagenomics pipeline Philosophy Overview data analysis Data analysis using selected EBI and external software tools QC steps + tutorial Overview of functional analysis Result outputs Others public pipelines
Philosophy behind EBI Metagenomics pipeline Helping metagenomics researchers make sense of their data From chaos to structure: archiving of data with metadata performing stringent QC filtering prior to analysis quality in, quality out performing robust taxonomy and functional analysis model-based rather than similarity-based approaches assignment done on reads rather than assembly intuitive navigation through website constant drive to improvement benchmarking and tool testing
EBI Metagenomics currently do not perform assembly Why? absence of reference genome short reads make chimaera inevitable Ex: re-analysis of Hess et al, Science (2011) 331:463 What are the consequences? cannot link taxonomy information to functional annotations cannot currently perform viral taxonomy analysis
Metagenomics data analysis Diversity analysis Quality control Functional analysis Image credits: (1) Christina Toft & Siv G. E. Andersson; (2) Dalebroux Z D et al. Microbiol. Mol. Biol. Rev. 2010;74:171-199
Overview of EBI Metagenomics Pipeline raw reads Quality control discarded reads that fail QC Amplicon -based data processed reads rrnaselector reads with rrna Qiime reads without rrna FragGeneScan predicted CDS InterProScan Unknown function pcds Taxonomic analysis Function assignment
EBI Metagenomics pipeline Philosophy Overview data analysis QC steps + tutorial Overview of functional analysis Data analysis using selected EBI and external software tools Result outputs Others public pipelines
EBI Metagenomics: QC rationale Why? Garbage in, garbage out Base call error: - each base call has a quality score associated - platform-dependent errors Reads quality decreases with reads length NGS generates duplicate reads (false and real). Reducing duplication reduces analysis time and prevent analysis bias.
EBI Metagenomics: QC step by step Clipping - low quality ends trimmed and adapter/barcode sequences removed using Biopython SeqIO package Quality filtering - sequences with > 10% undetermined nucleotides removed Read length filtering - depending on the platform short sequences are removed Duplicate sequences removal - clustered on 99% identity (UCLUST v 1.1.579) and representative sequence chosen Repeat masking - RepeatMasker (open-3.2.2), removed reads with 50% or more nucleotides masked
EBI Metagenomics: QC consequences Roche 454 Ion Torrent Illumina
EBI Metagenomics: overview of functional analysis reads without rrna FragGeneScan predicted CDS InterProScan Unknown function pcds Function assignment
EBI Metagenomics: identification of coding sequences Prediction of coding sequences is a challenge read length sequencing errors: frame-shift Two main types of approaches: homology-based methods: identify only known coding sequences feature-based approaches: predict probability that ORF are coding EBI Metagenomics uses FragGeneScan : hidden Markov models to correct frame-shift using codon usage probabilistic identification of start and stop codons 60 bp minimum ORF Rho et al. (2010) NAR 38-20
EBI Metagenomics: annotation of coding sequences Most available pipelines use homology-based methods (such as BLAST) compare a query sequence with a database of sequences identify database sequences that resemble the query sequence with homology score above a certain threshold However sequences may appear to have low homology score because: proteins may share homology only in limited domains proteins from different species can differ in length EBI Metagenomics pipeline do not use pairwise similarity based methods to associate functions to predicted protein sequences instead we use InterProScan to mine the InterPro database
EBI Metagenomics: Avantage of InterPro InterPro database (HMM and profile based functional analysis) based on presence of signatures (models) from several databases Specificity: mapping is manually curated BLAST vs. UniRef100 hit C7VBM8, Predicted protein C7VC62, Predicted protein InterProScan hit 5-formyltetrahydrofolate cyclo-ligase-like (IPR024185) Transcription regulator HTH, LysR (IPR000847) Speed Test set of 40692 predicted protein sequences BLAST vs UniRef100 = 21.5 s/cds InterProScan (5 databases) = 3 s/cds
EBI Metagenomics: overview of taxonomy analysis processed reads rrnaselector reads with rrna Amplicon-based data Qiime Taxonomic analysis
EBI Metagenomics: identification of suitable sequences Taxonomy analysis is generally based on identification and classification of rrna sequences Prokaryotes: archaebacteria and eubacteria: 5S, 16S and 23S Eukaryotes: 5S, 5.8S, 18S and 28S there is no equivalent for virus so depend on DNA polymerase or part of 5 -UTR (internal ribosomal entry site [IRES]) sequences EBI Metagenomics currently only provide taxonomy analysis for Prokaryotes. rrna sequences are identified using rrnaselector : hidden Markov models to identified rrna sequences 60 bp minimum overlap with well-curated HMM model E-value < 10-5 Lee et al (2011) J Microbiol. 49(4)
EBI Metagenomics: identification of suitable sequences Once identified, rrna sequences are clustered and classified using Qiime QIIME stands for Quantitative Insights Into Microbial Ecology. QIIME is an open source software package for comparison and analysis of microbial communities The main steps are: clustering sequences in Operational Taxonomy Unit (OTU) using uclust picking a representative sequence set (one sequence from each OTU) aligning the representative sequence set assigning taxonomy to the representative sequence set using PyNAST generating output files: filtering the alignment prior to tree building building phylogenetic tree creating OTU table
EBI Metagenomics pipeline in a nut shell QC : - trim adaptor sequences, low quality sequence ends - remove duplicates and short sequences - remove low complexity sequences, Powerful and sophisticated alternative to BLAST-based functional metagenomic analysis Diversity analysis : - identify prokaryotic rrnasequences (5, 16 and 23s) - cluster rrna-containing reads - assign taxonomy classification using Qiime, Functional analysis : - predict ORFs - translate ORFs into peptides - submit to InterProScan for functional annotation
EBI Metagenomics pipeline Philosophy Overview data analysis Data analysis using selected EBI and external software tools QC steps + tutorial Overview of functional analysis Overview of taxonomy analysis Result outputs Others public pipelines
Current outputs of EBI Metagenomics pipeline Visualisation Download - QC and sequence statistics - Diversity analysis - Functional analysis
EBI Metagenomics pipeline: taxonomy visualisation Google charts dynamic representation switch to bar chart, column or Krona interactive views
EBI Metagenomics pipeline: functional visualisation Google charts dynamic representation Interpro matchers links to InterPro website Gene ontology
EBI Metagenomics pipeline : download options Large starting material Small size output for post-processing
EBI Metagenomics pipeline Philosophy Overview data analysis Data analysis using selected EBI and external software tools QC steps + tutorial Overview of functional analysis Overview of taxonomy analysis Result outputs Others public pipelines
Some other Metagenomics tools http://ab.inf.uni-tuebingen.de/software/megan/ http://www.computationalbioenergy.org/software.html http://cbcb.umd.edu/software/metamos
Public Metagenomics portals http://www.ebi.ac.uk/metagenomics/ http://metagenomics.anl.gov/ http://camera.calit2.net/ http://img.jgi.doe.gov/
http://www.ebi.ac.uk/metagenomics Thanks to EMG Team, InterPro team and you for your attention