Accelerating Genomic Computations 1000X with Hardware

Accelerating Genomic Computations 1000X with Hardware Yatish Turakhia EE PhD candidate Stanford University Prof. Bill Dally (Electrical Engineering and Computer Science) Prof. Gill Bejerano (Computer Science, Developmental Biology and Pediatrics)

DNA sequencing costs and data explosion 1 st gen Since 2003, genomics data doubling every 7 months! Exabyte data by 2025 100M to 2B genomes to be sequenced! Stephens, Zachary D., et al. "Big data: astronomical or genomical?." PLoS Biology (2015) 2nd gen 3rd gen Storing and processing genome data will exceed the computing challenges of running YouTube and Twitter, biologists warn. [Nature News, 2015] The decreasing cost of sequencing and the increasing number of sequence reads being generated are placing greater demand on the computational resources and knowledge necessary to handle sequence data. [Genome Biology, 2016] 2

Genomic Granular Computing Applications Neonatal ICU 4 million newborns per year in the US alone 1 in 33 newborns with rare genetic conditions admitted to NICU Time of essence for genome-based diagnosis Non-invasively diagnose for over 3,000 rare genetic conditions (e.g. Down Syndrome) Free-floating DNA in blood enormous volume! Prenatal ICU and IVF clinics 3 Liquid Biopsy Early cancer detection life-saving application for millions of individuals Non-invasive circulating tumor DNA Periodic sequencing of healthy individuals - enormous volume!

Patient Diagnosis: Sample-to-answer Patient Reads 1 2 ATGTCGAT CGATACGA GAGTCATC ACTGACGT Read assembly Genome (3 Billion base pairs) REFERENCE:--ATGTCGATGATCCAGAGGATACTAGGATAT- PATIENT: --ATGTCTATGATC--GAGGATATTAGGATAT- Mutations 3 Genome Sequencing Machine Find the causal mutation Long reads (>10Kbp) offer a better resolution of the mutation spectrum but have high error rate (15-40%) >1,300 CPU hours for reference-guided assembly of noisy long reads 14.2M CPU-years for 100M individuals >15,600 CPU hours for de novo assembly of noisy long reads 178M CPU-years for 100M individuals 4

Darwin: A Genomics Co-processor Query (Q) D-SOFT Reference (R) D-SOFT (filter) D-SOFT API Darwin GACT (aligner) GACT API Query (Q) GACT Software Aligner Reference (R) High speed and programmability 1. D-SOFT: Tunable speed/precision to match any error profile 2. GACT: First algorithm with O(1) memory for computeintensive step of alignment allowing arbitrarily long alignments in hardware ideal for long reads 3. First framework shown to accelerate reference-guided as well as de novo assembly of reads in hardware 5

Darwin: 40nm ASIC configuration LPDDR4 (32GB) LPDDR4 (32GB) Software D-SOFT API GACT API Darwin D-SOFT GACT GACT GACT GACT GACT GACT GACT GACT Software (Intel Xeon E5) Algorithm Power (1 thread) BWA-MEM 9.2W GraphMap 10.7W DALIGNER 8.8W Area: 300mm 2 Power: 9W 6

7 GACT algorithm and hardware design

Strategies for long sequence alignment Algorithm Time Space (compute-intensive step) Optimal Smith-Waterman O(mn) O(mn) Y Hirschberg O(mn) O(m+n) Y Banded Smith- Waterman O(n) O(n) N X-drop O(n) O(n) N GACT O(n) O(1) N m, n: sequence lengths m >= n Profound hardware design implications Prior assumptions (hardware) Small upper bound on sequence length n OR Trace-back of alignment in software SLOW! 8

Genome Alignment using Constant-memory Trace-back (GACT) 1. Initialize I curr, J curr in R, Q 2. Form tile of maximum size T around I curr, J curr in R, Q 3. Align tile with trace-back from I curr, J curr with at most (T-D) steps 4. Update I curr, J curr with traceback end coordinates 5. Repeat 2-4 till extension no longer possible Query (Q) * G G T C G T T T Reference (R) * G G C G A C T T T Tile 1 Tile 3 T = 5, D=2 Tile 2 (I curr, J curr ) (I curr, J curr ) Optimal Alignment G G - C G A C T T T G G T C G - - T T T Score = 11 Alignment G G - C G A C T T T G G T C G - - T T T Score = 11 9

GACT empirically provides optimal alignments } GACT tile size T=400 } GACT compared to optimal Smith-Waterman for 200,000 10Kbp sequences with 4 different error rates: 10%, 20%, 30% and 40% } Simple scoring (match: +1, mismatch: -1, gap: -1) } At D=120, all observed alignments were optimal D (in bp) 10 Fraction alignments nonoptimal Worst-case score loss 10% 20% 30% 40% 10% 20% 30% 40% 0 30.4% 61.0% 83.0% 94.7% 0.29% 0.67% 1.26% 2.38% 30 0.0% 0.02% 0.55% 55.3% 0.0% 0.35% 0.63% 1.59% 60 0.0% 0.0% 0.01% 1.38% 0.0% 0.0% 0.34% 0.81% 90 0.0% 0.0% 0.0% 0.05% 0.0% 0.0% 0.0% 0.33% 120 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0%

GACT Hardware-acceleration Reference A C T A A G G T C G G T A T = 9 PE 0 PE 1 PE 2 PE 3 G C T G A G T Query Block 1 SRAM SRAM SRAM SRAM Query C A C T Query Block 2 A TB Logic T Query Block 3 } Systolic array of N pe (= 4) processing elements (PEs) solve Smith-Waterman-Gotoh } Tile with size T > N pe, query divided into blocks, reference streamed through each block } Computation exploits wave-front parallelism } On-chip SRAM for storing trace-back state (4-bit per cell) } Total SRAM size = 4-bit x (T max ) 2 => 128KB for T max = 512 11

Darwin: GACT Performance 1000000 574K GACT (Software) Edlib GACT (Darwin) 100000 302X 108K 54K Alignments/sec 10000 1000 35X 100 591X 19X 986X 11X 10 1 1 2 3 4 5 6 7 8 9 10 Sequence length (Kbp) Runtime scales linearly to sequence length 300-1000X faster than Edlib 10,000X faster than software implementation of GACT 12

13 D-SOFT algorithm and hardware design

Seed Position table based exact matching R = AGCTATACTA Seed Positions AA AC 6 AG 0 AT 4 CA CC CG CT 2 7 GA GC 1 GG GT Q = GCTA Q 3 2 1 0 GC:1 CT: 2, 7 TA: 3, 5, 8 Slope=1 1 2 3 4 5 6 7 8 R TA 3 5 8 TC TG For human genome, seed position table size > 12GB (4B x 3 x 10 9 ) TT 14

Diagonal-band Seed Overlapping based Filtration Technique (D-SOFT) Query (Q) 10 9 8 7 6 5 4 3 2 1 6 5 9 4 0 5 Bin 1 Bin 2 Bin 3 Bin 4 Bin 5 Bin 6 Reference (R) N B = 6 N = 10 k = 4 h = 7 } Divide R into N B bins (diagonal bands) } Use N seeds of size k bp from different offsets in Q } Lookup positions of seeds in R and assign each seed hit to corresponding bin (diagonal band) } Count non-overlapping Q base-pairs covered by seed hits for each bin and filter based on threshold h (same as DALIGNER) 15

D-SOFT hardware-acceleration design Area: 264 mm 2 Power: 7.3W Random accesses to update bins using on-chip SRAM (bin count SRAM) Area and power both dominated by 64MB Bin count SRAM Hardware exploits DRAM channel parallelism for seed position lookup 16

D-SOFT hardware-acceleration throughput k Avg. hits per seed (Human Genome) Throughput (10 3 seeds/sec) Software Darwin Darwin speedup 11 1765 7.9 760.6 96.3X 12 457 29.1 2,796.2 96.1X 13 118 136.1 9,126.3 67.1X 14 32 339.0 21,271.1 62.7X 15 8 784.3 34,166.7 43.5X } ~2X speedup from parallel DRAM channels } ~3X reduction in number of memory accesses to the DRAM } All random memory accesses to update bins using on-chip SRAM (64MB) } On-chip updates completely hide off-chip (DRAM) bandwidth 17

18 Long read assembly on Darwin

Darwin: Read assembly Reference-guided De novo 19

Darwin: Performance Results Reference-guided (54X human genome) Read Error Rate D-SOFT settings (k, N, h) Baseline Sensitivity Darwin Speedup 15% (14, 750, 24) 95.95% 99.91% 4,110X 30% (12, 1000, 25) 98.11% 98.40% 4,088X 40% (11, 1300, 22) 97.10% 97.40% 128X Baseline: BWA-MEM (15%), GraphMap (30%, 40%) De novo (54X human genome) Read Error Rate D-SOFT settings (k, N, h) Baseline Sensitivity Darwin Speedup (Bottleneck) 15% (14, 1300, 24) 99.80% 99.89% 264X Baseline: DALIGNER 20

Thank you! Questions or feedback? 21