I AM NOT A METAGENOMIC EXPERT. I am merely the MESSENGER. Blaise T.F. Alako, PhD EBI Ambassador

Size: px
Start display at page:

Download "I AM NOT A METAGENOMIC EXPERT. I am merely the MESSENGER. Blaise T.F. Alako, PhD EBI Ambassador"

Transcription

1 I AM NOT A METAGENOMIC EXPERT I am merely the MESSENGER Blaise T.F. Alako, PhD EBI Ambassador blaise@ebi.ac.uk

2 Hubert Denise Alex Mitchell Peter Sterk Sarah Hunter

3 Blaise T.F. Alako EBI Ambassador

4 Where is the true cost of NGS? 70 % (~80 bp/$) 14.5 % 28 % (~2m bp/$) 14.5 % 30 % 4.5 % 14.5 % 36.5 % 55 % Sboner et al. Genome Biology (2011) 12:125

5 EBI Metagenomics pipeline Philosophy Overview data analysis Data analysis using selected EBI and external software tools QC steps + tutorial Overview of functional analysis Result outputs Others public pipelines

6 Philosophy behind EBI Metagenomics pipeline Helping metagenomics researchers make sense of their data From chaos to structure: archiving of data with metadata performing stringent QC filtering prior to analysis quality in, quality out performing robust taxonomy and functional analysis model-based rather than similarity-based approaches assignment done on reads rather than assembly intuitive navigation through website constant drive to improvement benchmarking and tool testing

7 EBI Metagenomics currently do not perform assembly Why? absence of reference genome short reads make chimaera inevitable Ex: re-analysis of Hess et al, Science (2011) 331:463 What are the consequences? cannot link taxonomy information to functional annotations cannot currently perform viral taxonomy analysis

8 Metagenomics data analysis Diversity analysis Quality control Functional analysis Image credits: (1) Christina Toft & Siv G. E. Andersson; (2) Dalebroux Z D et al. Microbiol. Mol. Biol. Rev. 2010;74:

9 Overview of EBI Metagenomics Pipeline raw reads Quality control discarded reads that fail QC Amplicon -based data processed reads rrnaselector reads with rrna Qiime reads without rrna FragGeneScan predicted CDS InterProScan Unknown function pcds Taxonomic analysis Function assignment

10 EBI Metagenomics pipeline Philosophy Overview data analysis QC steps + tutorial Overview of functional analysis Data analysis using selected EBI and external software tools Result outputs Others public pipelines

11 EBI Metagenomics: QC rationale Why? Garbage in, garbage out Base call error: - each base call has a quality score associated - platform-dependent errors Reads quality decreases with reads length NGS generates duplicate reads (false and real). Reducing duplication reduces analysis time and prevent analysis bias.

12 EBI Metagenomics: QC step by step Clipping - low quality ends trimmed and adapter/barcode sequences removed using Biopython SeqIO package Quality filtering - sequences with > 10% undetermined nucleotides removed Read length filtering - depending on the platform short sequences are removed Duplicate sequences removal - clustered on 99% identity (UCLUST v ) and representative sequence chosen Repeat masking - RepeatMasker (open-3.2.2), removed reads with 50% or more nucleotides masked

13 EBI Metagenomics: QC consequences Roche 454 Ion Torrent Illumina

14 EBI Metagenomics: overview of functional analysis reads without rrna FragGeneScan predicted CDS InterProScan Unknown function pcds Function assignment

15 EBI Metagenomics: identification of coding sequences Prediction of coding sequences is a challenge read length sequencing errors: frame-shift Two main types of approaches: homology-based methods: identify only known coding sequences feature-based approaches: predict probability that ORF are coding EBI Metagenomics uses FragGeneScan : hidden Markov models to correct frame-shift using codon usage probabilistic identification of start and stop codons 60 bp minimum ORF Rho et al. (2010) NAR 38-20

16 EBI Metagenomics: annotation of coding sequences Most available pipelines use homology-based methods (such as BLAST) compare a query sequence with a database of sequences identify database sequences that resemble the query sequence with homology score above a certain threshold However sequences may appear to have low homology score because: proteins may share homology only in limited domains proteins from different species can differ in length EBI Metagenomics pipeline do not use pairwise similarity based methods to associate functions to predicted protein sequences instead we use InterProScan to mine the InterPro database

17 EBI Metagenomics: Avantage of InterPro InterPro database (HMM and profile based functional analysis) based on presence of signatures (models) from several databases Specificity: mapping is manually curated BLAST vs. UniRef100 hit C7VBM8, Predicted protein C7VC62, Predicted protein InterProScan hit 5-formyltetrahydrofolate cyclo-ligase-like (IPR024185) Transcription regulator HTH, LysR (IPR000847) Speed Test set of predicted protein sequences BLAST vs UniRef100 = 21.5 s/cds InterProScan (5 databases) = 3 s/cds

18 EBI Metagenomics: overview of taxonomy analysis processed reads rrnaselector reads with rrna Amplicon-based data Qiime Taxonomic analysis

19 EBI Metagenomics: identification of suitable sequences Taxonomy analysis is generally based on identification and classification of rrna sequences Prokaryotes: archaebacteria and eubacteria: 5S, 16S and 23S Eukaryotes: 5S, 5.8S, 18S and 28S there is no equivalent for virus so depend on DNA polymerase or part of 5 -UTR (internal ribosomal entry site [IRES]) sequences EBI Metagenomics currently only provide taxonomy analysis for Prokaryotes. rrna sequences are identified using rrnaselector : hidden Markov models to identified rrna sequences 60 bp minimum overlap with well-curated HMM model E-value < 10-5 Lee et al (2011) J Microbiol. 49(4)

20 EBI Metagenomics: identification of suitable sequences Once identified, rrna sequences are clustered and classified using Qiime QIIME stands for Quantitative Insights Into Microbial Ecology. QIIME is an open source software package for comparison and analysis of microbial communities The main steps are: clustering sequences in Operational Taxonomy Unit (OTU) using uclust picking a representative sequence set (one sequence from each OTU) aligning the representative sequence set assigning taxonomy to the representative sequence set using PyNAST generating output files: filtering the alignment prior to tree building building phylogenetic tree creating OTU table

21 EBI Metagenomics pipeline in a nut shell QC : - trim adaptor sequences, low quality sequence ends - remove duplicates and short sequences - remove low complexity sequences, Powerful and sophisticated alternative to BLAST-based functional metagenomic analysis Diversity analysis : - identify prokaryotic rrnasequences (5, 16 and 23s) - cluster rrna-containing reads - assign taxonomy classification using Qiime, Functional analysis : - predict ORFs - translate ORFs into peptides - submit to InterProScan for functional annotation

22 EBI Metagenomics pipeline Philosophy Overview data analysis Data analysis using selected EBI and external software tools QC steps + tutorial Overview of functional analysis Overview of taxonomy analysis Result outputs Others public pipelines

23 Current outputs of EBI Metagenomics pipeline Visualisation Download - QC and sequence statistics - Diversity analysis - Functional analysis

24 EBI Metagenomics pipeline: taxonomy visualisation Google charts dynamic representation switch to bar chart, column or Krona interactive views

25 EBI Metagenomics pipeline: functional visualisation Google charts dynamic representation Interpro matchers links to InterPro website Gene ontology

26 EBI Metagenomics pipeline : download options Large starting material Small size output for post-processing

27 EBI Metagenomics pipeline Philosophy Overview data analysis Data analysis using selected EBI and external software tools QC steps + tutorial Overview of functional analysis Overview of taxonomy analysis Result outputs Others public pipelines

28 Some other Metagenomics tools

29 Public Metagenomics portals

30 Thanks to EMG Team, InterPro team and you for your attention