Accelerate High Throughput Analysis for Genome Sequencing with GPU

Similar documents
GPU Technology Conference 2012 May 14-17, 2012 San Jose, California BingQiang WANG, Head of Scalable Computing, BGI

Next Generation Bioinformatics on the Cloud

The Expanded Illumina Sequencing Portfolio New Sample Prep Solutions and Workflow

Crash-course in genomics

LARGE DATA AND BIOMEDICAL COMPUTATIONAL PIPELINES FOR COMPLEX DISEASES

Basics of RNA-Seq. (With a Focus on Application to Single Cell RNA-Seq) Michael Kelly, PhD Team Lead, NCI Single Cell Analysis Facility

Illumina (Solexa) Throughput: 4 Tbp in one run (5 days) Cheapest sequencing technology. Mismatch errors dominate. Cost: ~$1000 per human genme

Genome 373: Mapping Short Sequence Reads II. Doug Fowler

Variation detection based on second generation sequencing data. Xin LIU Department of Science and Technology, BGI

NEXT GENERATION SEQUENCING. Farhat Habib

Accelerate precision medicine with Microsoft Genomics

IMPORTANT: Visit for the most up-to-date schedule.

DNBseq TM SERVICE OVERVIEW Plant and Animal Whole Genome Re-Sequencing

CUMACH - A Fast GPU-based Genotype Imputation Tool. Agatha Hu

Why can GBS be complicated? Tools for filtering, error correction and imputation.

The Sentieon Genomic Tools Improved Best Practices Pipelines for Analysis of Germline and Tumor-Normal Samples

The New Genome Analyzer IIx Delivering more data, faster, and easier than ever before. Jeremy Preston, PhD Marketing Manager, Sequencing

Introduction. CS482/682 Computational Techniques in Biological Sequence Analysis

Introduction to Bioinformatics and Gene Expression Technologies

Introduction to Bioinformatics and Gene Expression Technologies

Bioinformatics Tools. Stuart M. Brown, Ph.D Dept of Cell Biology NYU School of Medicine

AGILENT S BIOINFORMATICS ANALYSIS SOFTWARE

Sequencing technologies. Jose Blanca COMAV institute bioinf.comav.upv.es

Why can GBS be complicated? Tools for filtering & error correction. Edward Buckler USDA-ARS Cornell University

Targeted Sequencing in the NBS Laboratory

The Computational Impact of Genomics on Biotechnology R&D (sort of )

Next Generation Sequencing. Tobias Österlund

Bioinformatics and computational tools

DNA METHYLATION RESEARCH TOOLS

Introduction to 'Omics and Bioinformatics

NGS in Pathology Webinar

HLA and Next Generation Sequencing it s all about the Data

Introduction to Bioinformatics

Novel HPC technologies for Rapid Analysis in Bioinformatics Presenter: Paul Walsh, nsilico Life Science Ltd, Ireland

Whole Human Genome Sequencing Report This is a technical summary report for PG DNA

This place covers: Methods or systems for genetic or protein-related data processing in computational molecular biology.

Genomics. Data Analysis & Visualization. Camilo Valdes

Processing Data from Next Generation Sequencing

SEQUENCING. M Ataei, PhD. Feb 2016

Trans-omics for a better life

Genomic Technologies. Michael Schatz. Feb 1, 2018 Lecture 2: Applied Comparative Genomics

Nanoparticle risk assessment. practical solutions. systems toxicology

A Crash Course in NGS for GI Pathologists. Sandra O Toole

ChIP-seq and RNA-seq. Farhat Habib

latestdevelopments relevant for the Ag sector André Eggen Agriculture Segment Manager, Europe

Rapid Parallel Genome Indexing using MapReduce

Benno Pütz. MPI of Psychiatry

SNPs - GWAS - eqtls. Sebastian Schmeier

Sequencing technologies. Jose Blanca COMAV institute bioinf.comav.upv.es

Lecture 2: Central Dogma of Molecular Biology & Intro to Programming

High-Throughput Bioinformatics: Re-sequencing and de novo assembly. Elena Czeizler

Sequencing technologies. Jose Blanca COMAV institute bioinf.comav.upv.es

Assembling a Cassava Transcriptome using Galaxy on a High Performance Computing Cluster

Bioinformatics and Life Sciences Standards and Programming for Heterogeneous Architectures

About Strand NGS. Strand Genomics, Inc All rights reserved.

Accelerating Genomic Computations 1000X with Hardware

High peformance computing infrastructure for bioinformatics

From Variants to Pathways: Agilent GeneSpring GX s Variant Analysis Workflow

NGS Approaches to Epigenomics

Computational Challenges in Life Sciences Research Infrastructures

UAB DNA-Seq Analysis Workshop. John Osborne Research Associate Centers for Clinical and Translational Science

Evolutionary Genetics: Part 1 Polymorphism in DNA


BIOINFORMATICS AND SYSTEM BIOLOGY (INTERNATIONAL PROGRAM)

ChIP-seq and RNA-seq

Next Generation Sequencing: An Overview

Comparing a few SNP calling algorithms using low-coverage sequencing data

Theory and Application of Multiple Sequence Alignments

Satellite Education Workshop (SW4): Epigenomics: Design, Implementation and Analysis for RNA-seq and Methyl-seq Experiments

Machine Learning. HMM applications in computational biology

The Future of IntegrOmics

Computational Challenges of Medical Genomics

Experimental Design Microbial Sequencing

GBS Usage Cases: Non-model Organisms. Katie E. Hyma, PhD Bioinformatics Core Institute for Genomic Diversity Cornell University

HaloPlex HS. Get to Know Your DNA. Every Single Fragment. Kevin Poon, Ph.D.

Introduction to RNA-Seq in GeneSpring NGS Software

'Bioinformatics in academia as related to ehealth' - including the "Genomic Virtual Lab"

NextSeq 500 System WGS Solution

Whole Transcriptome Analysis of Illumina RNA- Seq Data. Ryan Peters Field Application Specialist

Ultrasequencing: Methods and Applications of the New Generation Sequencing Platforms

Compute- and Data-Intensive Analyses in Bioinformatics"

INTRODUCTION TO MOLECULAR GENETICS. Andrew McQuillin Molecular Psychiatry Laboratory UCL Division of Psychiatry 22 Sept 2017

Introduction to Genome Wide Association Studies 2014 Sydney Brenner Institute for Molecular Bioscience/Wits Bioinformatics Shaun Aron

Following text taken from Suresh Kumar. Bioinformatics Web - Comprehensive educational resource on Bioinformatics. 6th May.2005

Deep learning sequence-based ab initio prediction of variant effects on expression and disease risk

Peter Ungaro President and CEO

E2ES to Accelerate Next-Generation Genome Analysis in Clinical Research

Genomic Data Is Going Google. Ask Bigger Biological Questions

C3BI. VARIANTS CALLING November Pierre Lechat Stéphane Descorps-Declère

Data Basics. Josef K Vogt Slides by: Simon Rasmussen Next Generation Sequencing Analysis

Updated software SIFT 4G can further research in human health, the study of biological processes and agricultural products

H3A - Genome-Wide Association testing SOP

Overcome limitations with RNA-Seq

2/19/13. Contents. Applications of HMMs in Epigenomics

BIOINFORMATICS TO ANALYZE AND COMPARE GENOMES

Bioinformatics Advice on Experimental Design

Application of Genotyping-By-Sequencing and Genome-Wide Association Analysis in Tetraploid Potato

Users? Data. Implications for HPC in Biology Dr Thomas R Connor School of Biosciences, Cardiff University

The MiniSeq System. Explore the possibilities. Discover demonstrated NGS workflows for molecular biology applications.

Reading for lecture 2

Transcription:

Accelerate High Throughput Analysis for Genome Sequencing with GPU ATIP - A*CRC Workshop on Accelerator Technologies in High Performance Computing May 7-10, 2012 Singapore BingQiang WANG, Head of Scalable Computing, BGI Simon SEE, Chief Scientific Computing Advisor, BGI

DNA double helix encodes secrets of life Facts: Human Genome - Trillions of cells - 23 pairs of chromosomes - 2 meters of DNA - 3 billion DNA subunits (A,T,C,G) - Approximately 30,000 genes code for proteins that perform most life functions

Different Types of Gene Variation

Sequencer, Sequencing, Sequence Sample Preparation Library Sequencing Analysis Results Sequencing technology turns the secret codes (A, T, C, G) into digital form Thus enabling computational research

Next Generation Sequencing (NGS) Indeed 2 nd generation sequencing technology Low cost (several K$ per human genome) High throughput Short reads (small pieces of DNA strand) Lots, lots of data

Big Data Incoming Breadth As sequencing cost falling falling down More individuals are being sequenced Thousands of human individuals: diagnostics and treatment of diseases Tens of thousands of rice individuals: molecular breeding, more food Depth Combining data from other sources / levels DNA, RNA, protein And, dynamics Living cells, living life

Exponential growth Approximately 10 fold every 18 months! Craig Venter, Multiple personal genomes await, Nature, Vol 464, April 2010

Collecting and integrating large-scale, diverse types of data

we are able to isolate and sequence individual cells, monitor the dynamics of single molecules in real time and lower the cost of the technologies that generate all of these data, such that hundreds of millions of individuals can be profiled. Sequencing DNA, RNA, the epigenome, the metabolome and the proteome from numerous cells in millions of individuals, and sequencing environmentally collected samples routinely from thousands of locations a day Eric E. Schadt et al, Computational Solutions to Large-scale Data Management and Analysis, Nature Reviews Genetics, Vol 11, September 2010

Sequencing @BGI World s leading sequencing and genomics research center Started with Human Genome Project in 1999 Several sequencers at that time Now more than 150 sequencers Mass spectrometers to capture protein information Complement sequencing Proteome MODEL ABI 3730XL Roche 454 ABI SOLiD 4 Solexa GA IIx Illumina HiSeq 2000 INSTALLATION 16 1 27 6 135

Computing @BGI Sequencing throughput 6T base pairs per day (upgraded from 4T) ~20 PB data storage Connecting raw data and scientific discovery Analysis tools High performance computing is the key Computing horsepower ~20,000 cores 20+ GPUs ~220 Tflops peak performance Still increasing

Sequencing vs Computing Observation Exponential growth of sequence data output What will happen if, demand for computation grows with amount of data, as o(n) o(n 2 ) beyond o(n 2 )?

Computational Challenges Classical sequence data analysis Alignment as o(n) Variant calling as o(n) Growing computing demand let us mine for gold Population genomics as o(n 2 ) Gene association study as beyond o(n 2 ) Systems biology with various levels of data as beyond o(n 2 ) Sequencing cost down leads to more and more high dimensional analysis Lots, lots of computing

Solution: Disruptive Computing Technology Computing Technology Sequencing Technology Scientific and Clinical Interests

GPU Accelerated Bioinformatics Research Individual tools for routine analysis SOAP3 / SOAP3-DP aligner SNP calling with GSNP Tackle challenging scientific questions Gene variation and association First step: High resolution genotyping with GAMA-MPI

SOAP3 Aligner History and Intro Sequence alignment is a way of arranging the sequences of DNA, RNA, or protein to identify regions of similarity that may be a consequence of functional, structural, or evolutionary relationships between the sequences. (from Wikipedia) SOAP: first-generation short read alignment tool SOAP2 (2008): 20 to 30 times faster than SOAP, less memory Collaboration between BGI & HKU Compressed indexing: bidirectional BWT (2BWT) SOAP3 (2011): 10 to 30 times faster than SOAP2 Collaboration from HKU GPU s parallel processing power CPU memory: increase from a few to tens GB GPU-based indexing: GPU-2BWT

SOAP3 reference reads Alignment BWT Algorithm position1 position2 position reference A C G A T A C A T A C G Sampled suffix array SA Interval Rotation & sorting BWT Index T T G G C A A A C C A A read One site by one site --- from right to left backward search 17

2-BWT : Implemented in SOAP2 and SOAP3 SOAP3 forward search backward search 1- Mismatch alignment 2- Mismatch alignment Case A: The mismatch lies on the first X characters Case B: The mismatch lies on the last m-x characters Case A: Both mismatches lie on P[1..X+Y] Case B: Both mismatches lie on P[X+Y+1..m] Case C: 1st mismatch lies on P[X+1..X+Y], 2nd mismatch lies on P[X+Y+1..m] Case D: 1st mismatch lies on P[1..X], 2nd mismatch lies on P[X+Y+1..m] 18

SOAP3 core ideas SOAP3 Reduce memory access Control branching effect Engineering the index - 2-level sampling becomes 1- level sampling - group data items according to retrieval patterns instead of logical functions - redundancy a search step takes four 32-bit & two 256-bit memory accesses SOAP3 V.S SOAP2 a search step takes two 32-bit & two 128-bit memory accesses Round 1 Round 2 Round 3 only finish those really simples one group those complicated reads together for another round extremely complicated ones are left to the CPU 19

SOAP3 Architecture SOAP3 Host (CPU) Memory-resident data structures Device (GPU) Memory-resdient data structures 2BWT SA 2BWT Execution Execution Process complicated reads & load more reads Process complicated reads & load more reads Process complicated reads & load more reads Process complicated reads & load more reads Process reads with simple alignment structure Process reads with simple alignment structure Process reads with simple alignment structure Process reads with 20

Data type Reads length (bp) Total Number of Reads (million ) Mismatch number SOAP3 (Total Time: second) SOP2 Time for reading reads Time for alignment and output Total time Total time (second) SOAP3 Alignment Speed-up ratio (second) Human 100 16 3 83.30 128.23 211.53 1893.45 14.12 Zebra fish 76 21 3 95.50 724.32 819.81 10671.39 14.6 Alignment Ratio (%) Total Time (second) 100 80 60 84.2 88.29 64.49 76.55 12000 10671.39 40 10000 20 8000 0 Human Zebra fish 6000 4000 2000 0 1893.45 211.53 Human Zebra fish 819.81 15 14.5 14 SOAP2 SOAP3 Speedup Ratio 14.12 14.6 SOAP2 SOAP3 13.5 13 Human Zebra fish

Compared with its predecessor SOAP3 Actual alignment time (excluding index and read loading time) 1600 1400 1200 1000 1447 Alignment Time SOAP3-DP 800 600 594 493 SOAP3-dp SOAP3 400 200 193 0 70.7M 25.3M SOAP3-dp is about 2.5 times more efficient.

Single-end Alignment Compared with Other Tools SOAP3-dp is the fastest and reports the highest percentage of aligned reads SOAP3-DP Total Time of Single-end Alignment (second) Alignment Ratio of Single-end Alignment (%) 10000 8000 6000 4000 2000 0 70.7M 25.3M SOAP3-dp SOAP3 BWA Barracuda Bowtie2 100 98 96 94 92 90 88 86 84 70.7M 25.3M SOAP3-dp SOAP3 BWA Barracuda Bowtie2

Paired-end Alignment Compared with Other Tools SOAP3-dp also show it s superior in paired-end alignment SOAP3-DP Total Time of Paired-end Alignment (second) Alignment Ratio of Paired-end Alignment (%) 25000 100 20000 15000 10000 5000 SOAP3-dp SOAP3 BWA Barracuda Bowtie2 95 90 85 80 75 SOAP3-dp SOAP3 BWA Barracuda Bowtie2 0 70.7M 25.3M 70 70.7M 25.3M

SOAP3-DP Speedup Compared with Other Tools SOAP3-dp is about 2 times faster than SOAP3, while at least 10 times faster when comparing with other tools 20 18 16 14 12 10 8 6 4 2 0 Speedup VS. SOAP3 VS. BWA VS. Bowtie2 Paired-end Single-end

GSNP Data quality Experimental errors Alignment quality score for each base Calculate likelihood for each genotype Prior probability of each genotype Bayesian theory Infer genotype A site A C G A T A C A Differences from reference for each read Alignment A C G T T 12 10 9 6 7 C G A C A 11 10 8 7 5 G T T A C 13 11 9 7 6 read quality score A T A G A 13 11 9 7 6 26

GSNP The parallelization strategy on the GPU: one thread handles one site Optimization techniques of GSNP Thread 0 Thread 1 Thread 2 Thread 3 Thread 4 Thread 5 Thread 6 Thread 7 A C G A T A C A A C G T T 12 10 9 6 7 C G A C A 11 10 8 7 5 G T T A C 13 11 9 7 6 A T A G A 13 11 9 7 6 Reduce memory overhead and branch divergence Balance workloads The Consistency of GPU and CPU Results Reduce I/O cost 27

constant memory on the GPU GSNP The sparse representation of aligned bases Solution to the consistency of GPU and CPU results Site 1 The measured non-zero%: ~0.1% Site 1 Searching for the result SOAPsnp GSNP Site 2 Site 2 64 possible results Computing the result 28

Elapsed time (sec.) Elapsed time (sec.) GSNP End-to-End Performance Comparison of GSNP 100000 Ch.1 21879 10000 Ch. 21 3675 10000 1000 1000 100 10 527 100 10 73 1 GSNP SOAPsnp 1 GSNP SOAPsnp The elapsed time of all components are included. GSNP is around 50X faster than the single-thread CPU-based SOAPsnp. 29

GPU Accelerated Bioinformatics Research Individual tools for routine analysis SOAP3 / SOAP3-DP aligner SNP calling with GSNP Tackle challenging scientific questions Gene variation and association First step: High resolution genotyping with GAMA-MPI

Estimating MAF in a Population with GPU Within a population, SNPs can be assigned a minor allele frequency the lowest allele frequency at a locus that is observed in a particular population. There are variations between human populations, so a SNP allele that is common in one geographical or ethnic group may be much rarer in another. (from Wikipedia) MAF is the foundation of genome wide association study (GWAS), e.g. HapMap project Our approach is a highly accurate yet computationally very expensive one (o(n 2 ))

GAMA Individual SNP Calling alignment reference A C G A G reads G T C G C A G A C C allele A a or A or A?? A a a a A A A A a A a a One individual Probability of AA Aa aa Compute minor allele frequency Add another individual The site has a probability for the occurrence of allele a and A 32

Genotype Likelihood GAMA A A a A a a A A a a 1 0.8 a A a a a A A A 0.6 0.4 0.2 Different sites represent different alleles Individual 1 Individual 2 Individual 3 C B A 0 0 50 100 150 200 250 Allele Frequency Compute allele frequency likelihood for each site Individual 33

Thread 0 Thread 1 Thread GAMA A thread handles a site? SFS construction Bad solution! One Site Computation Multiple threads handle a site : the operator Thread 0 Thread 1 Thread 2 Thread 3 Thread 4 34

GAMA Dataset: Human genome, 512 individuals ( 1024 input files), full scan of 3G sites Version Computing time Total Time Computing Speedup CPU ~ 1518 days ~ 1619 days Total Speedup Note GPU (Single) GPU (86 with MPI *) ~ 15.75 hours ~ 101 days 2313 16 against CPU ~ 717 seconds ~ 5.4 hours 79 449 against single GPU * 86 nodes x 12 cores per node = 1032 cores, with one core processing one file 35

GAMA Dataset: Human genome, 512 individuals ( 1024 input files) Running on TH-1A supercomputer, GPU accelerated version Node Sites Computing Time Parse Time Alltoall Time Output Time Total Time 8 3G 1h55m 20h10m 28m 1h41m 26h30m 8 10M 23.01s 242s 5.61s 20.27s 318s 8 1M 2.40s 24.08s 0.88s 1.19s 40s

Estimating MAF in a Population with Multiple GPUs Joint development with Tianjin Supercomputing Center Based on TH-1A CUDA + MPI Parallel file system (Lustre) AllToAll comm (now)

Summary GPU is very promising to accelerate bioinformatics analysis and life science research Efforts need to be made to leverage data access and computation For large-scale data analysis, especially GWAS type analysis, more study needed

Our Observation Bioinformatics is turning from high throughput computing to data intensive computing RIGHT NOW Tools and systems need to be developed See GAMA-MPI example Tens of minutes computation Several hours for data loading / decompression / parsing / filtering Data intensive architecture Data compression technology Data awareness scheduling

Acknowledgement Team members and Xing Xu, Lin Fang Our research collaborators Prof T.K. Lam, Dr S.M. Yiu from HKU Prof Qiong Luo and group member from HKUST Mian Lu Jiuxin Zhao Yuwei Tan Prof Xiaowen Chu from HKBU National Supercomputing Center at Tianjin

Next Generation Bioinformatics on the Cloud http://www.easygenomics.com Sifei He Director of BGI Cloud hesifei@genomics.cn Xing Xu, Ph.D Senior Product Manager EasyGenomics BGI xuxing@genomics.cn Contact Us info@easygenomics.com

Problems and Solutions Solutions Cloud High Speed Data Exchange Workflows Problems: Big genomic data Geological distribution Algorithm integration +) Resource Management Computational demand 42

EasyGenomics Database, Data management Computational Resources Algorithms, Workflows, Reports High speed connection Web portal, Simple UI EasyGenomics is the bioinformatics platform for research and applications on the cloud

Thank you wangbingqiang@genomics.cn

Backup Slides

node 1 GAMA-MPI Flow Chart Node N start main thread 1 thread M main thread 1 thread M wait decompress and parse data Alltoall data exchange calculate and output result decompress and parse data wait calculate and output result Alltoall data exchange decompress and parse data wait calculate and output result Alltoall data exchange wait finish