Computing for Metagenome Analysis

Similar documents
Transcription:

New Horizons of Computational Science with Heterogeneous Many-Core Processors Computing for Metagenome Analysis National Institute of Genetics Hiroshi Mori & Ken Kurokawa

Contents Metagenome Sequence similarity search in metagenome Our tools PEZY-version of our tool

Metagenomics (since 1998) Genome analysis against Microbial community to know member compositions and functions Togo picture gallery by DBCLS is licensed under a Creative Commons Attribution 2.1 Japan license (c) Microbiota sampling DNA extraction from microbes DNA sequencing Bioinformatics analysis

Data from the NHGRI Genome Sequencing Program (GSP) https://www.genome.gov/27541954/dna-sequencing-costs-data/ New Generation DNA Sequencers (NGS)

Metagenomic analsysis of 13 human intestinal microbiome (before NGS) Kurokawa et al. 2007 Sample Sample name Sex Age Total reads Total read length (bp) Q>15 Individual In-A Male 45 81687 52509363 Individual In-B Male 6 months 80617 62792581 Individual In-D Male 35 84237 55137918 Individual In-E Male 3 months 80852 56781600 Individual In-M Female 4 months 89340 57808421 Individual In-R Female 24 85787 55404826 Family I F1-S Male 30 78452 53568019 F1-T Female 28 81348 55365235 F1-U Female 7 months 82525 53864663 Family II F2-V Male 37 80772 55926002 F2-W Female 36 79163 54885684 F2-X Male 3 80858 56587120 F2-Y Female 1.5 79754 56276047 Total 1,065,392 reads 726,907,479 bp 7 adults 2 children 4 infants

Microbial diversity based on the overall sequence similarities of genes among individuals Adults/ children Multi-dimensional scaling (MDS) analysis of cumulative bitscore from reciprocal pairwise BLASTP searches High similarity among adults and children High variation among infants A drastic change during weaning No strong association within families Unweaned infants Kurokawa et al. 2007

HMP (US) 2008-2012 140 million $ MetaHIT (EU) 2008-2011 11.4 million

Arumugam et al. 2011

Mosca et al. 2016

Valencia et al. 2017

Mosca et al. 2016

140000 120000 Human microbiome s sequence data (GB) >120 TB >0.27 M samples 100000 80000 60000 40000 20000 0 Aug-07 Dec-07 Apr-08 Aug-18 Dec-08 Apr-09 Aug-09 Dec-09 Apr-10 Aug-10 Dec-10 Apr-11 Aug-11 Dec-11 Apr-12 Aug-12 Dec-12 Apr-13 Aug-13 Dec-13 Apr-14 Aug-14 Dec-14 Apr-15 Aug-15 Dec-15 Apr-16 Aug-16 Dec-16 Apr-17 Aug-17 Dec-17 Calculated from INSDC DRA/ERA/SRA data

16S rrna gene amplicon sequencing analysis Pre-analysis (Remove Primer, Chimera etc.) DNA extraction Sequence clustering with species level by CD-HIT-EST or UCLUST, etc. PCR amplification DNA Sequencing Taxonomic assignment and Comparison between samples Togo picture gallery by DBCLS is licensed under a Creative Commons Attribution 2.1 Japan license (c) Who s there?

Shotgun metagenomic sequencing analysis Metagenomic reads Assemble MEGAHIT, IDBA-UD etc. DNA extraction Contig sets Gene finding MGA, MetaGeneMark etc. Gene sets DNA Sequencing BLASTP Togo picture gallery by DBCLS is licensed under a Creative Commons Attribution 2.1 Japan license (c) Sample1 Metadata Sample2 Metadata Comparative metagenomics Gene Function abundance Pathway abundance Pathway reconstruction Taxonomic abundance Who s there? What are they doing?

FASTA formatted file from NGS >illumina_reads_1 CCTCAACCGGTTTCGATTACTTTTCTTTAGGGATAGCCGCCTGGATATCG GTGCTGTTAGCGGCGGCtGCTGTCAGTTTTGAATTGGCTATCTCCGGCAC AATCACCTTCACTAAAGTAATTACTGCCATGCTGAGTACCCATATGTTGA TTGGCTTGGGAGAAGTTTTAATGACGGTCAGTGGTTGCTGGTTGCTGGTA AGAACTAAAAATGTTGAGAGCCATAACTGGAATATTATTACACCACTGAT >illumina_reads_2 GTCCTCAACAATTATTGCGCTAACGCTCAGTCCGTTTGCTTCTGGGTTTC CAGATGGTTTGGAATGGGTAGCGGCAAAATATCAGTTTCTTCATCAGTCA GCCCCGACTTTCGTTGCTCCACTGGCCAATTACACTGTTCCTGGGTTAAA CAATGAGTTATTTTCTACGGGAATGGCTGGATTAATGGGAGTTCTTATTA CTTTTGGCATGGCG >illumina_reads_3 TGGCTCCTTGGAAGTTTAATCAATCATCCTTCCAGCGTGAGAAAGTTAGC CTAAAAATCTCCGCTATCAATTTAAGGAATTCTTTTTTtAATTGGTTGTA TCGTTTTTGTGATTAAAATCGGTGCGTTACGGCTATTACCTAACGCACCC TACCACCATCAACTTAAGGAATTCTCCCTCGTTTCCAGGTTTAAAA About 120GB(0.5 billion seq) per NGS 1 sequencing run

Why sequence similarity search are important? Two DNA sequences are similar They may share recent common ancestor (one gene) They may have similar properties (e.g. biological function, phylogenetic position)

Sequence similarity search Query seq. Similarity search Target DB Amplicon seq. query Seq. num. ~= 10,000 100,000 Each seq. length ~= 150 500 base Total ~= 1 500MB Shotgun seq. query Seq. num. ~= 1,000,000 40,000,000 Each seq. length ~= 150 300 base Total ~= 150MB 10GB Target DB (for amplicon) Seq. num. ~= 300,000 Each seq. len. ~= 1,500 base Total ~= 30MB All sequences have some similarity Target DB (for shotgun) Seq. num. ~= 17,384 Each seq. len ~= 0.5M 10M base Total ~= 35GB Small amount of sequences have similarity

Nature Methods. 2015 Nature Biotech. 2017 Bioinformatics. 2017 Bioinformatics. 2014

Ultra-fast sequence similarity search tool for metagenomic data (DNA vs DNA) with GPGPU High accuracy comparable to BLAST Possible to analyze long sequences as queries No need for pre-indexing the target DB

CLAST: CUDA implemented Long read Alignment Search Tool Sensitivity Speed Yano, Mori et al. 2014

http://vitcomic.org

VITCOMIC2 analysis workflow Mori H et al in press

Comparison of taxonomic composition inference accuracy among tools Spearman correlation coefficient against theoretical Software name composition VITCOMIC2 0.654 MAPseq (Rodrigues JFM, et al. 2017, Bioinformatics) 0.582 SortMeRNA (Kopylova E, et al. 2012, Bioinformatics) 0.629 RiboTagger (Xie C, et al. 2016, BMC Bioinformatics) 0.588 Mori H et al in press Although we developed CLAST and VITCOMIC2,

We want to accomplish real-time metagenome analysis MinION TM device https://nanoporetech.com/

CLAST for Heterogeneous Many-Core Processors PZLAST Sneak preview for PZLAST Query (16S amplicon seq. data) Seq. num. = 157,931 Each seq. length = 151 base Total = 29.6MB Target (Bacterial 16S seq. database) Seq. num. = 31,198 Each seq. len. ~= 1,500 base Total = 47.3MB CLAST (GPGPU) Xeon 2.30GHz 14 core x 2 512GB RAM GPU NVIDIA Tesla K20m (Kepler) x 1 PZLAST (Many-Core) PEZY-SC2 x 32 (1 Brick)

Our mission Realize real-time analysis of metagenomic data by utilizing Heterogeneous Many-Core Processors. Analyze a wide variety of samples, and challenge to predict disease risk or environmental change.

Acknowledgements Ken Kurokawa (NIG) Masahiro Yano (TITECH) Hitoshi Ishikawa, Yoshinori Kimura (PEZY) Toshikazu Ebisuzaki (RIKEN)