Computing for Metagenome Analysis

Size: px

Start display at page:

Download "Computing for Metagenome Analysis"

Molly Watson
5 years ago
Views:

1 New Horizons of Computational Science with Heterogeneous Many-Core Processors Computing for Metagenome Analysis National Institute of Genetics Hiroshi Mori & Ken Kurokawa

2 Contents Metagenome Sequence similarity search in metagenome Our tools PEZY-version of our tool

3 Metagenomics (since 1998) Genome analysis against Microbial community to know member compositions and functions Togo picture gallery by DBCLS is licensed under a Creative Commons Attribution 2.1 Japan license (c) Microbiota sampling DNA extraction from microbes DNA sequencing Bioinformatics analysis

Data from the NHGRI Genome Sequencing Program (GSP) https://www.genome.

4 Data from the NHGRI Genome Sequencing Program (GSP) New Generation DNA Sequencers (NGS)

5 Metagenomic analsysis of 13 human intestinal microbiome (before NGS) Kurokawa et al Sample Sample name Sex Age Total reads Total read length (bp) Q>15 Individual In-A Male Individual In-B Male 6 months Individual In-D Male Individual In-E Male 3 months Individual In-M Female 4 months Individual In-R Female Family I F1-S Male F1-T Female F1-U Female 7 months Family II F2-V Male F2-W Female F2-X Male F2-Y Female Total 1,065,392 reads 726,907,479 bp 7 adults 2 children 4 infants

6 Microbial diversity based on the overall sequence similarities of genes among individuals Adults/ children Multi-dimensional scaling (MDS) analysis of cumulative bitscore from reciprocal pairwise BLASTP searches High similarity among adults and children High variation among infants A drastic change during weaning No strong association within families Unweaned infants Kurokawa et al. 2007

7 HMP (US) million $ MetaHIT (EU) million

8 Arumugam et al. 2011

9 Mosca et al. 2016

10 Valencia et al. 2017

11 Mosca et al. 2016

12 Human microbiome s sequence data (GB) >120 TB >0.27 M samples Aug-07 Dec-07 Apr-08 Aug-18 Dec-08 Apr-09 Aug-09 Dec-09 Apr-10 Aug-10 Dec-10 Apr-11 Aug-11 Dec-11 Apr-12 Aug-12 Dec-12 Apr-13 Aug-13 Dec-13 Apr-14 Aug-14 Dec-14 Apr-15 Aug-15 Dec-15 Apr-16 Aug-16 Dec-16 Apr-17 Aug-17 Dec-17 Calculated from INSDC DRA/ERA/SRA data

13 16S rrna gene amplicon sequencing analysis Pre-analysis (Remove Primer, Chimera etc.) DNA extraction Sequence clustering with species level by CD-HIT-EST or UCLUST, etc. PCR amplification DNA Sequencing Taxonomic assignment and Comparison between samples Togo picture gallery by DBCLS is licensed under a Creative Commons Attribution 2.1 Japan license (c) Who s there?

Shotgun metagenomic sequencing analysis Metagenomic reads Assemble MEGAHIT,

Gene sets DNA Sequencing BLASTP Togo picture gallery by DBCLS is licensed

1 Japan license (c) Sample1 Metadata Sample2 Metadata Comparative

14 Shotgun metagenomic sequencing analysis Metagenomic reads Assemble MEGAHIT, IDBA-UD etc. DNA extraction Contig sets Gene finding MGA, MetaGeneMark etc. Gene sets DNA Sequencing BLASTP Togo picture gallery by DBCLS is licensed under a Creative Commons Attribution 2.1 Japan license (c) Sample1 Metadata Sample2 Metadata Comparative metagenomics Gene Function abundance Pathway abundance Pathway reconstruction Taxonomic abundance Who s there? What are they doing?

15 FASTA formatted file from NGS >illumina_reads_1 CCTCAACCGGTTTCGATTACTTTTCTTTAGGGATAGCCGCCTGGATATCG GTGCTGTTAGCGGCGGCtGCTGTCAGTTTTGAATTGGCTATCTCCGGCAC AATCACCTTCACTAAAGTAATTACTGCCATGCTGAGTACCCATATGTTGA TTGGCTTGGGAGAAGTTTTAATGACGGTCAGTGGTTGCTGGTTGCTGGTA AGAACTAAAAATGTTGAGAGCCATAACTGGAATATTATTACACCACTGAT >illumina_reads_2 GTCCTCAACAATTATTGCGCTAACGCTCAGTCCGTTTGCTTCTGGGTTTC CAGATGGTTTGGAATGGGTAGCGGCAAAATATCAGTTTCTTCATCAGTCA GCCCCGACTTTCGTTGCTCCACTGGCCAATTACACTGTTCCTGGGTTAAA CAATGAGTTATTTTCTACGGGAATGGCTGGATTAATGGGAGTTCTTATTA CTTTTGGCATGGCG >illumina_reads_3 TGGCTCCTTGGAAGTTTAATCAATCATCCTTCCAGCGTGAGAAAGTTAGC CTAAAAATCTCCGCTATCAATTTAAGGAATTCTTTTTTtAATTGGTTGTA TCGTTTTTGTGATTAAAATCGGTGCGTTACGGCTATTACCTAACGCACCC TACCACCATCAACTTAAGGAATTCTCCCTCGTTTCCAGGTTTAAAA About 120GB(0.5 billion seq) per NGS 1 sequencing run

16 Why sequence similarity search are important? Two DNA sequences are similar They may share recent common ancestor (one gene) They may have similar properties (e.g. biological function, phylogenetic position)

Sequence similarity search Query seq. Similarity search Target DB Amplicon seq. query Seq. num. ~= 10,000 100,000 Each seq. length ~= 150 500 base Total ~= 1 500MB Shotgun seq. query Seq. num. ~= 1,000,000 40,000,000 Each seq.

17 Sequence similarity search Query seq. Similarity search Target DB Amplicon seq. query Seq. num. ~= 10, ,000 Each seq. length ~= base Total ~= 1 500MB Shotgun seq. query Seq. num. ~= 1,000,000 40,000,000 Each seq. length ~= base Total ~= 150MB 10GB Target DB (for amplicon) Seq. num. ~= 300,000 Each seq. len. ~= 1,500 base Total ~= 30MB All sequences have some similarity Target DB (for shotgun) Seq. num. ~= 17,384 Each seq. len ~= 0.5M 10M base Total ~= 35GB Small amount of sequences have similarity

18 Nature Methods Nature Biotech Bioinformatics Bioinformatics. 2014

19 Ultra-fast sequence similarity search tool for metagenomic data (DNA vs DNA) with GPGPU High accuracy comparable to BLAST Possible to analyze long sequences as queries No need for pre-indexing the target DB

20 CLAST: CUDA implemented Long read Alignment Search Tool Sensitivity Speed Yano, Mori et al. 2014

22 VITCOMIC2 analysis workflow Mori H et al in press

23 Comparison of taxonomic composition inference accuracy among tools Spearman correlation coefficient against theoretical Software name composition VITCOMIC MAPseq (Rodrigues JFM, et al. 2017, Bioinformatics) SortMeRNA (Kopylova E, et al. 2012, Bioinformatics) RiboTagger (Xie C, et al. 2016, BMC Bioinformatics) Mori H et al in press Although we developed CLAST and VITCOMIC2,

24 We want to accomplish real-time metagenome analysis MinION TM device

25 CLAST for Heterogeneous Many-Core Processors PZLAST Sneak preview for PZLAST Query (16S amplicon seq. data) Seq. num. = 157,931 Each seq. length = 151 base Total = 29.6MB Target (Bacterial 16S seq. database) Seq. num. = 31,198 Each seq. len. ~= 1,500 base Total = 47.3MB CLAST (GPGPU) Xeon 2.30GHz 14 core x 2 512GB RAM GPU NVIDIA Tesla K20m (Kepler) x 1 PZLAST (Many-Core) PEZY-SC2 x 32 (1 Brick)

26 Our mission Realize real-time analysis of metagenomic data by utilizing Heterogeneous Many-Core Processors. Analyze a wide variety of samples, and challenge to predict disease risk or environmental change.

27 Acknowledgements Ken Kurokawa (NIG) Masahiro Yano (TITECH) Hitoshi Ishikawa, Yoshinori Kimura (PEZY) Toshikazu Ebisuzaki (RIKEN)