Accelerating Genomic Computations 1000X with Hardware

Size: px
Start display at page:

Download "Accelerating Genomic Computations 1000X with Hardware"

Transcription

1 Accelerating Genomic Computations 1000X with Hardware Yatish Turakhia EE PhD candidate Stanford University Prof. Bill Dally (Electrical Engineering and Computer Science) Prof. Gill Bejerano (Computer Science, Developmental Biology and Pediatrics)

2 DNA sequencing costs and data explosion 1 st gen Since 2003, genomics data doubling every 7 months! Exabyte data by M to 2B genomes to be sequenced! Stephens, Zachary D., et al. "Big data: astronomical or genomical?." PLoS Biology (2015) 2nd gen 3rd gen Storing and processing genome data will exceed the computing challenges of running YouTube and Twitter, biologists warn. [Nature News, 2015] The decreasing cost of sequencing and the increasing number of sequence reads being generated are placing greater demand on the computational resources and knowledge necessary to handle sequence data. [Genome Biology, 2016] 2

3 Genomic Granular Computing Applications Neonatal ICU 4 million newborns per year in the US alone 1 in 33 newborns with rare genetic conditions admitted to NICU Time of essence for genome-based diagnosis Non-invasively diagnose for over 3,000 rare genetic conditions (e.g. Down Syndrome) Free-floating DNA in blood enormous volume! Prenatal ICU and IVF clinics 3 Liquid Biopsy Early cancer detection life-saving application for millions of individuals Non-invasive circulating tumor DNA Periodic sequencing of healthy individuals - enormous volume!

4 Patient Diagnosis: Sample-to-answer Patient Reads 1 2 ATGTCGAT CGATACGA GAGTCATC ACTGACGT Read assembly Genome (3 Billion base pairs) REFERENCE:--ATGTCGATGATCCAGAGGATACTAGGATAT- PATIENT: --ATGTCTATGATC--GAGGATATTAGGATAT- Mutations 3 Genome Sequencing Machine Find the causal mutation Long reads (>10Kbp) offer a better resolution of the mutation spectrum but have high error rate (15-40%) >1,300 CPU hours for reference-guided assembly of noisy long reads 14.2M CPU-years for 100M individuals >15,600 CPU hours for de novo assembly of noisy long reads 178M CPU-years for 100M individuals 4

5 Darwin: A Genomics Co-processor Query (Q) D-SOFT Reference (R) D-SOFT (filter) D-SOFT API Darwin GACT (aligner) GACT API Query (Q) GACT Software Aligner Reference (R) High speed and programmability 1. D-SOFT: Tunable speed/precision to match any error profile 2. GACT: First algorithm with O(1) memory for computeintensive step of alignment allowing arbitrarily long alignments in hardware ideal for long reads 3. First framework shown to accelerate reference-guided as well as de novo assembly of reads in hardware 5

6 Darwin: 40nm ASIC configuration LPDDR4 (32GB) LPDDR4 (32GB) Software D-SOFT API GACT API Darwin D-SOFT GACT GACT GACT GACT GACT GACT GACT GACT Software (Intel Xeon E5) Algorithm Power (1 thread) BWA-MEM 9.2W GraphMap 10.7W DALIGNER 8.8W Area: 300mm 2 Power: 9W 6

7 7 GACT algorithm and hardware design

8 Strategies for long sequence alignment Algorithm Time Space (compute-intensive step) Optimal Smith-Waterman O(mn) O(mn) Y Hirschberg O(mn) O(m+n) Y Banded Smith- Waterman O(n) O(n) N X-drop O(n) O(n) N GACT O(n) O(1) N m, n: sequence lengths m >= n Profound hardware design implications Prior assumptions (hardware) Small upper bound on sequence length n OR Trace-back of alignment in software SLOW! 8

9 Genome Alignment using Constant-memory Trace-back (GACT) 1. Initialize I curr, J curr in R, Q 2. Form tile of maximum size T around I curr, J curr in R, Q 3. Align tile with trace-back from I curr, J curr with at most (T-D) steps 4. Update I curr, J curr with traceback end coordinates 5. Repeat 2-4 till extension no longer possible Query (Q) * G G T C G T T T Reference (R) * G G C G A C T T T Tile 1 Tile 3 T = 5, D=2 Tile 2 (I curr, J curr ) (I curr, J curr ) Optimal Alignment G G - C G A C T T T G G T C G - - T T T Score = 11 Alignment G G - C G A C T T T G G T C G - - T T T Score = 11 9

10 GACT empirically provides optimal alignments } GACT tile size T=400 } GACT compared to optimal Smith-Waterman for 200,000 10Kbp sequences with 4 different error rates: 10%, 20%, 30% and 40% } Simple scoring (match: +1, mismatch: -1, gap: -1) } At D=120, all observed alignments were optimal D (in bp) 10 Fraction alignments nonoptimal Worst-case score loss 10% 20% 30% 40% 10% 20% 30% 40% % 61.0% 83.0% 94.7% 0.29% 0.67% 1.26% 2.38% % 0.02% 0.55% 55.3% 0.0% 0.35% 0.63% 1.59% % 0.0% 0.01% 1.38% 0.0% 0.0% 0.34% 0.81% % 0.0% 0.0% 0.05% 0.0% 0.0% 0.0% 0.33% % 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0%

11 GACT Hardware-acceleration Reference A C T A A G G T C G G T A T = 9 PE 0 PE 1 PE 2 PE 3 G C T G A G T Query Block 1 SRAM SRAM SRAM SRAM Query C A C T Query Block 2 A TB Logic T Query Block 3 } Systolic array of N pe (= 4) processing elements (PEs) solve Smith-Waterman-Gotoh } Tile with size T > N pe, query divided into blocks, reference streamed through each block } Computation exploits wave-front parallelism } On-chip SRAM for storing trace-back state (4-bit per cell) } Total SRAM size = 4-bit x (T max ) 2 => 128KB for T max =

12 Darwin: GACT Performance K GACT (Software) Edlib GACT (Darwin) X 108K 54K Alignments/sec X X 19X 986X 11X Sequence length (Kbp) Runtime scales linearly to sequence length X faster than Edlib 10,000X faster than software implementation of GACT 12

13 13 D-SOFT algorithm and hardware design

14 Seed Position table based exact matching R = AGCTATACTA Seed Positions AA AC 6 AG 0 AT 4 CA CC CG CT 2 7 GA GC 1 GG GT Q = GCTA Q GC:1 CT: 2, 7 TA: 3, 5, 8 Slope= R TA TC TG For human genome, seed position table size > 12GB (4B x 3 x 10 9 ) TT 14

15 Diagonal-band Seed Overlapping based Filtration Technique (D-SOFT) Query (Q) Bin 1 Bin 2 Bin 3 Bin 4 Bin 5 Bin 6 Reference (R) N B = 6 N = 10 k = 4 h = 7 } Divide R into N B bins (diagonal bands) } Use N seeds of size k bp from different offsets in Q } Lookup positions of seeds in R and assign each seed hit to corresponding bin (diagonal band) } Count non-overlapping Q base-pairs covered by seed hits for each bin and filter based on threshold h (same as DALIGNER) 15

16 D-SOFT hardware-acceleration design Area: 264 mm 2 Power: 7.3W Random accesses to update bins using on-chip SRAM (bin count SRAM) Area and power both dominated by 64MB Bin count SRAM Hardware exploits DRAM channel parallelism for seed position lookup 16

17 D-SOFT hardware-acceleration throughput k Avg. hits per seed (Human Genome) Throughput (10 3 seeds/sec) Software Darwin Darwin speedup X , X , X , X , X } ~2X speedup from parallel DRAM channels } ~3X reduction in number of memory accesses to the DRAM } All random memory accesses to update bins using on-chip SRAM (64MB) } On-chip updates completely hide off-chip (DRAM) bandwidth 17

18 18 Long read assembly on Darwin

19 Darwin: Read assembly Reference-guided De novo 19

20 Darwin: Performance Results Reference-guided (54X human genome) Read Error Rate D-SOFT settings (k, N, h) Baseline Sensitivity Darwin Speedup 15% (14, 750, 24) 95.95% 99.91% 4,110X 30% (12, 1000, 25) 98.11% 98.40% 4,088X 40% (11, 1300, 22) 97.10% 97.40% 128X Baseline: BWA-MEM (15%), GraphMap (30%, 40%) De novo (54X human genome) Read Error Rate D-SOFT settings (k, N, h) Baseline Sensitivity Darwin Speedup (Bottleneck) 15% (14, 1300, 24) 99.80% 99.89% 264X Baseline: DALIGNER 20

21 Thank you! Questions or feedback? 21

SWAMP: Smith-Waterman using Associative Massive Parallelism

SWAMP: Smith-Waterman using Associative Massive Parallelism SWAMP: Smith-Waterman using Associative Massive Parallelism Shannon I. Steinfadt and Johnnie Baker 9th International Workshop on Parallel and Distributed Scientific and Engineering Computing (PDSEC 08)

More information

Figure S4 A-H : Initiation site properties and evolutionary changes

Figure S4 A-H : Initiation site properties and evolutionary changes A 0.3 Figure S4 A-H : Initiation site properties and evolutionary changes G-correction not used 0.25 Fraction of total counts 0.2 0.5 0. tag 2 tags 3 tags 4 tags 5 tags 6 tags 7tags 8tags 9 tags >9 tags

More information

SYMPOSIUM March 22-23, 2018

SYMPOSIUM March 22-23, 2018 Bigger and Better Data Lessons from Frontlines of Precision Medicine Getting Your Transformation Right Frank Lee PhD IBM Global Industry Leader for Systems Group SYMPOSIUM March 22-23, 2018 5th Annual

More information

Outline. General principles of clonal sequencing Analysis principles Applications CNV analysis Genome architecture

Outline. General principles of clonal sequencing Analysis principles Applications CNV analysis Genome architecture The use of new sequencing technologies for genome analysis Chris Mattocks National Genetics Reference Laboratory (Wessex) NGRL (Wessex) 2008 Outline General principles of clonal sequencing Analysis principles

More information

Database Searching and BLAST Dannie Durand

Database Searching and BLAST Dannie Durand Computational Genomics and Molecular Biology, Fall 2013 1 Database Searching and BLAST Dannie Durand Tuesday, October 8th Review: Karlin-Altschul Statistics Recall that a Maximal Segment Pair (MSP) is

More information

Transcription factor binding site prediction in vivo using DNA sequence and shape features

Transcription factor binding site prediction in vivo using DNA sequence and shape features Transcription factor binding site prediction in vivo using DNA sequence and shape features Anthony Mathelier, Lin Yang, Tsu-Pei Chiu, Remo Rohs, and Wyeth Wasserman anthony.mathelier@gmail.com @AMathelier

More information

Illumina (Solexa) Throughput: 4 Tbp in one run (5 days) Cheapest sequencing technology. Mismatch errors dominate. Cost: ~$1000 per human genme

Illumina (Solexa) Throughput: 4 Tbp in one run (5 days) Cheapest sequencing technology. Mismatch errors dominate. Cost: ~$1000 per human genme Illumina (Solexa) Current market leader Based on sequencing by synthesis Current read length 100-150bp Paired-end easy, longer matepairs harder Error ~0.1% Mismatch errors dominate Throughput: 4 Tbp in

More information

Variation detection based on second generation sequencing data. Xin LIU Department of Science and Technology, BGI

Variation detection based on second generation sequencing data. Xin LIU Department of Science and Technology, BGI Variation detection based on second generation sequencing data Xin LIU Department of Science and Technology, BGI liuxin@genomics.org.cn 2013.11.21 Outline Summary of sequencing techniques Data quality

More information

Accelerate High Throughput Analysis for Genome Sequencing with GPU

Accelerate High Throughput Analysis for Genome Sequencing with GPU Accelerate High Throughput Analysis for Genome Sequencing with GPU ATIP - A*CRC Workshop on Accelerator Technologies in High Performance Computing May 7-10, 2012 Singapore BingQiang WANG, Head of Scalable

More information

HiSeqTM 2000 Sequencing System

HiSeqTM 2000 Sequencing System IET International Equipment Trading Ltd. www.ietltd.com Proudly serving laboratories worldwide since 1979 CALL +847.913.0777 for Refurbished & Certified Lab Equipment HiSeqTM 2000 Sequencing System Performance

More information

Accelerating Motif Finding in DNA Sequences with Multicore CPUs

Accelerating Motif Finding in DNA Sequences with Multicore CPUs Accelerating Motif Finding in DNA Sequences with Multicore CPUs Pramitha Perera and Roshan Ragel, Member, IEEE Abstract Motif discovery in DNA sequences is a challenging task in molecular biology. In computational

More information

Creation of a PAM matrix

Creation of a PAM matrix Rationale for substitution matrices Substitution matrices are a way of keeping track of the structural, physical and chemical properties of the amino acids in proteins, in such a fashion that less detrimental

More information

Introduction to Bioinformatics

Introduction to Bioinformatics Introduction to Bioinformatics Alla L Lapidus, Ph.D. SPbSU St. Petersburg Term Bioinformatics Term Bioinformatics was invented by Paulien Hogeweg (Полина Хогевег) and Ben Hesper in 1970 as "the study of

More information

Dynamic Programming Algorithms

Dynamic Programming Algorithms Dynamic Programming Algorithms Sequence alignments, scores, and significance Lucy Skrabanek ICB, WMC February 7, 212 Sequence alignment Compare two (or more) sequences to: Find regions of conservation

More information

Rapid Parallel Genome Indexing using MapReduce

Rapid Parallel Genome Indexing using MapReduce Rapid Parallel Genome Indexing using MapReduce Rohith Menon, Goutham Bhat & Michael Schatz* June 8, 2011 HPDC 11/MapReduce Outline 1. Brief Overview of DNA Sequencing 2. Genome Indexing Serial, Basic MR,

More information

Addressing the I/O bottleneck of HPC workloads. Professor Mark Parsons NEXTGenIO Project Chairman Director, EPCC

Addressing the I/O bottleneck of HPC workloads. Professor Mark Parsons NEXTGenIO Project Chairman Director, EPCC Addressing the I/O bottleneck of HPC workloads Professor Mark Parsons NEXTGenIO Project Chairman Director, EPCC I/O is key Exascale challenge Parallelism beyond 100 million threads demands a new approach

More information

Sizing SAP Central Process Scheduling 8.0 by Redwood

Sizing SAP Central Process Scheduling 8.0 by Redwood Sizing SAP Central Process Scheduling 8.0 by Redwood Released for SAP Customers and Partners January 2012 Copyright 2012 SAP AG. All rights reserved. No part of this publication may be reproduced or transmitted

More information

Using FPGAs to Accelerate Neural Network Inference

Using FPGAs to Accelerate Neural Network Inference Using FPGAs to Accelerate Neural Network Inference 1 st FPL Workshop on Reconfigurable Computing for Deep Learning (RC4DL) 8. September 2017, Ghent, Belgium Associate Professor Magnus Jahre Department

More information

Oracle Financial Services Revenue Management and Billing V2.3 Performance Stress Test on Exalogic X3-2 & Exadata X3-2

Oracle Financial Services Revenue Management and Billing V2.3 Performance Stress Test on Exalogic X3-2 & Exadata X3-2 Oracle Financial Services Revenue Management and Billing V2.3 Performance Stress Test on Exalogic X3-2 & Exadata X3-2 O R A C L E W H I T E P A P E R J A N U A R Y 2 0 1 5 Table of Contents Disclaimer

More information

Genomic Data Is Going Google. Ask Bigger Biological Questions

Genomic Data Is Going Google. Ask Bigger Biological Questions Genomic Data Is Going Google Ask Bigger Biological Questions You know your research could have a significant scientific impact and answer questions that may redefine how a disease is diagnosed or treated.

More information

Alignment methods. Martijn Vermaat Department of Human Genetics Center for Human and Clinical Genetics

Alignment methods. Martijn Vermaat Department of Human Genetics Center for Human and Clinical Genetics Alignment methods Martijn Vermaat Department of Human Genetics Center for Human and Clinical Genetics Alignment methods Sequence alignment Assembly vs alignment Alignment methods Common issues Platform

More information

Chapter 10: Gene Expression and Regulation

Chapter 10: Gene Expression and Regulation Chapter 10: Gene Expression and Regulation Fact 1: DNA contains information but is unable to carry out actions Fact 2: Proteins are the workhorses but contain no information THUS Information in DNA must

More information

Read Mapping and Variant Calling. Johannes Starlinger

Read Mapping and Variant Calling. Johannes Starlinger Read Mapping and Variant Calling Johannes Starlinger Application Scenario: Personalized Cancer Therapy Different mutations require different therapy Collins, Meredith A., and Marina Pasca di Magliano.

More information

What about streaming data?

What about streaming data? What about streaming data? 1 The Stream Model Data enters at a rapid rate from one or more input ports Such data are called stream tuples The system cannot store the entire (infinite) stream Distribution

More information

High-yield, Scalable Library Preparation with the NEBNext Ultra II FS DNA Library Prep Kit

High-yield, Scalable Library Preparation with the NEBNext Ultra II FS DNA Library Prep Kit be INSPIRED drive DISCOVERY stay GENUINE TECHNICAL NOTE High-yield, Scalable Library Preparation with the NEBNext Ultra II FS DNA Library Prep Kit Improving performance, ease of use and reliability of

More information

The Sentieon Genomic Tools Improved Best Practices Pipelines for Analysis of Germline and Tumor-Normal Samples

The Sentieon Genomic Tools Improved Best Practices Pipelines for Analysis of Germline and Tumor-Normal Samples The Sentieon Genomic Tools Improved Best Practices Pipelines for Analysis of Germline and Tumor-Normal Samples Andreas Scherer, Ph.D. President and CEO Dr. Donald Freed, Bioinformatics Scientist, Sentieon

More information

Why learn sequence database searching? Searching Molecular Databases with BLAST

Why learn sequence database searching? Searching Molecular Databases with BLAST Why learn sequence database searching? Searching Molecular Databases with BLAST What have I cloned? Is this really!my gene"? Basic Local Alignment Search Tool How BLAST works Interpreting search results

More information

Genome Sequence Assembly

Genome Sequence Assembly Genome Sequence Assembly Learning Goals: Introduce the field of bioinformatics Familiarize the student with performing sequence alignments Understand the assembly process in genome sequencing Introduction:

More information

Plasmodium vivax. (Guerra, 2006) (Winzeler, 2008)

Plasmodium vivax. (Guerra, 2006) (Winzeler, 2008) Plasmodium vivax Major cause of malaria outside Africa 25 40% of clinical cases worldwide Not amenable to in vitro culture Interesting biology Hypnozoites: dormant liver stage responsible for relapses

More information

Course Overview: Mutation Detection Using Massively Parallel Sequencing

Course Overview: Mutation Detection Using Massively Parallel Sequencing Course Overview: Mutation Detection Using Massively Parallel Sequencing From Data Generation to Variant Annotation Eliot Shearer The Iowa Initiative in Human Genetics Bioinformatics Short Course 2012 August

More information

HPC Analytics in the Era of Big Data J. Robert Michael, PhD Sr. Software Engineer St. Jude Children s Research Hospital

HPC Analytics in the Era of Big Data J. Robert Michael, PhD Sr. Software Engineer St. Jude Children s Research Hospital HPC Analytics in the Era of Big Data J. Robert Michael, PhD Sr. Software Engineer St. Jude Children s Research Hospital Outline HPC at St Jude what are we doing? Automated image analysis what are the issues?

More information

The Sentieon Genomics Tools A fast and accurate solution to variant calling from next-generation sequence data

The Sentieon Genomics Tools A fast and accurate solution to variant calling from next-generation sequence data The Sentieon Genomics Tools A fast and accurate solution to variant calling from next-generation sequence data Donald Freed 1*, Rafael Aldana 1, Jessica A. Weber 2, Jeremy S. Edwards 3,4,5 1 Sentieon Inc,

More information

Next-Generation Sequencing. Technologies

Next-Generation Sequencing. Technologies Next-Generation Next-Generation Sequencing Technologies Sequencing Technologies Nicholas E. Navin, Ph.D. MD Anderson Cancer Center Dept. Genetics Dept. Bioinformatics Introduction to Bioinformatics GS011062

More information

Optimize the Performance of Your Cloud Infrastructure

Optimize the Performance of Your Cloud Infrastructure Optimize the Performance of Your Cloud Infrastructure AppFormix software leverages cutting-edge Intel Resource Director Technology (RDT) hardware features to improve cloud infrastructure monitoring and

More information

Machine learning applications in genomics: practical issues & challenges. Yuzhen Ye School of Informatics and Computing, Indiana University

Machine learning applications in genomics: practical issues & challenges. Yuzhen Ye School of Informatics and Computing, Indiana University Machine learning applications in genomics: practical issues & challenges Yuzhen Ye School of Informatics and Computing, Indiana University Reference Machine learning applications in genetics and genomics

More information

BLAST. compared with database sequences Sequences with many matches to high- scoring words are used for final alignments

BLAST. compared with database sequences Sequences with many matches to high- scoring words are used for final alignments BLAST 100 times faster than dynamic programming. Good for database searches. Derive a list of words of length w from query (e.g., 3 for protein, 11 for DNA) High-scoring words are compared with database

More information

Exploring the Genetic Basis of Congenital Heart Defects

Exploring the Genetic Basis of Congenital Heart Defects Exploring the Genetic Basis of Congenital Heart Defects Sanjay Siddhanti Jordan Hannel Vineeth Gangaram szsiddh@stanford.edu jfhannel@stanford.edu vineethg@stanford.edu 1 Introduction The Human Genome

More information

The Genome Analysis Centre. Building Excellence in Genomics and Computational Bioscience

The Genome Analysis Centre. Building Excellence in Genomics and Computational Bioscience Building Excellence in Genomics and Computational Bioscience Wheat genome sequencing: an update from TGAC Sequencing Technology Development now Plant & Microbial Genomics Group Leader Matthew Clark matt.clark@tgac.ac.uk

More information

Designing High Thermal Conductive Materials Using Artificial Evolution MICHAEL DAVIES, BASKAR GANAPATHYSUBRAMANIAN, GANESH BALASUBRAMANIAN

Designing High Thermal Conductive Materials Using Artificial Evolution MICHAEL DAVIES, BASKAR GANAPATHYSUBRAMANIAN, GANESH BALASUBRAMANIAN Designing High Thermal Conductive Materials Using Artificial Evolution MICHAEL DAVIES, BASKAR GANAPATHYSUBRAMANIAN, GANESH BALASUBRAMANIAN The Problem Graphene is one of the most thermally conductive materials

More information

Supplementary Figure 1

Supplementary Figure 1 number of cells, normalized number of cells, normalized number of cells, normalized Supplementary Figure CD CD53 Cd3e fluorescence intensity fluorescence intensity fluorescence intensity Supplementary

More information

Mike Strickland, Director, Data Center Solution Architect Intel Programmable Solutions Group July 2017

Mike Strickland, Director, Data Center Solution Architect Intel Programmable Solutions Group July 2017 Mike Strickland, Director, Data Center Solution Architect Intel Programmable Solutions Group July 2017 Accelerate Big Data Analytics with Intel Frameworks and Libraries with FPGA s 1. Intel Big Data Analytics

More information

Human Genomics, Precision Medicine, and Advancing Human Health. The Human Genome. The Origin of Genomics : 1987

Human Genomics, Precision Medicine, and Advancing Human Health. The Human Genome. The Origin of Genomics : 1987 Human Genomics, Precision Medicine, and Advancing Human Health Eric Green, M.D., Ph.D. Director, NHGRI The Human Genome Cells Nucleus Chromosome DNA Human Genome: 3 Billion Bases (letters) The Origin of

More information

Jack Weast. Principal Engineer, Chief Systems Engineer. Automated Driving Group, Intel

Jack Weast. Principal Engineer, Chief Systems Engineer. Automated Driving Group, Intel Jack Weast Principal Engineer, Chief Systems Engineer Automated Driving Group, Intel From the Intel Newsroom 2 Levels of Automated Driving Courtesy SAE International Ref: J3061 3 Simplified End-to-End

More information

Multiplex Assay Design

Multiplex Assay Design Multiplex Assay Design Geeta Bhat, Luminex Molecular Diagnostics; Toronto. APHL/CDC Newborn Screening Molecular Workshop, CDC, Atlanta, GA June 28-30, 2011 Luminex Multiplexed Solutions. For Life. Luminex

More information

Sequence Assembly and Alignment. Jim Noonan Department of Genetics

Sequence Assembly and Alignment. Jim Noonan Department of Genetics Sequence Assembly and Alignment Jim Noonan Department of Genetics james.noonan@yale.edu www.yale.edu/noonanlab The assembly problem >>10 9 sequencing reads 36 bp - 1 kb 3 Gb Outline Basic concepts in genome

More information

Application for Automating Database Storage of EST to Blast Results. Vikas Sharma Shrividya Shivkumar Nathan Helmick

Application for Automating Database Storage of EST to Blast Results. Vikas Sharma Shrividya Shivkumar Nathan Helmick Application for Automating Database Storage of EST to Blast Results Vikas Sharma Shrividya Shivkumar Nathan Helmick Outline Biology Primer Vikas Sharma System Overview Nathan Helmick Creating ESTs Nathan

More information

RNA-Sequencing analysis

RNA-Sequencing analysis RNA-Sequencing analysis Markus Kreuz 25. 04. 2012 Institut für Medizinische Informatik, Statistik und Epidemiologie Content: Biological background Overview transcriptomics RNA-Seq RNA-Seq technology Challenges

More information

Theory and Application of Multiple Sequence Alignments

Theory and Application of Multiple Sequence Alignments Theory and Application of Multiple Sequence Alignments a.k.a What is a Multiple Sequence Alignment, How to Make One, and What to Do With It Brett Pickett, PhD History Structure of DNA discovered (1953)

More information

What is Bioinformatics? Bioinformatics is the application of computational techniques to the discovery of knowledge from biological databases.

What is Bioinformatics? Bioinformatics is the application of computational techniques to the discovery of knowledge from biological databases. What is Bioinformatics? Bioinformatics is the application of computational techniques to the discovery of knowledge from biological databases. Bioinformatics is the marriage of molecular biology with computer

More information

Introductory Next Gen Workshop

Introductory Next Gen Workshop Introductory Next Gen Workshop http://www.illumina.ucr.edu/ http://www.genomics.ucr.edu/ Workshop Objectives Workshop aimed at those who are new to Illumina sequencing and will provide: - a basic overview

More information

ON USING DNA DISTANCES AND CONSENSUS IN REPEATS DETECTION

ON USING DNA DISTANCES AND CONSENSUS IN REPEATS DETECTION ON USING DNA DISTANCES AND CONSENSUS IN REPEATS DETECTION Petre G. POP Technical University of Cluj-Napoca, Romania petre.pop@com.utcluj.ro Abstract: Sequence repeats are the simplest form of regularity

More information

Target Enrichment Strategies for Next Generation Sequencing

Target Enrichment Strategies for Next Generation Sequencing Target Enrichment Strategies for Next Generation Sequencing Anuj Gupta, PhD Agilent Technologies, New Delhi Genotypic Conference, Sept 2014 NGS Timeline Information burst Nearly 30,000 human genomes sequenced

More information

CS 68: BIOINFORMATICS. Prof. Sara Mathieson Swarthmore College Spring 2018

CS 68: BIOINFORMATICS. Prof. Sara Mathieson Swarthmore College Spring 2018 CS 68: BIOINFORMATICS Prof. Sara Mathieson Swarthmore College Spring 2018 Outline: Jan 24 Central dogma of molecular biology Sequencing pipeline Begin: genome assembly Note: office hours Monday 3-5pm and

More information

Targeted Sequencing in the NBS Laboratory

Targeted Sequencing in the NBS Laboratory Targeted Sequencing in the NBS Laboratory Christopher Greene, PhD Newborn Screening and Molecular Biology Branch Division of Laboratory Sciences Gene Sequencing in Public Health Newborn Screening February

More information

Welcome to the NGS webinar series

Welcome to the NGS webinar series Welcome to the NGS webinar series Webinar 1 NGS: Introduction to technology, and applications NGS Technology Webinar 2 Targeted NGS for Cancer Research NGS in cancer Webinar 3 NGS: Data analysis for genetic

More information

Increasing Enterprise Support Demand & Complexity

Increasing Enterprise Support Demand & Complexity PTC System Monitor Increasing Enterprise Support Demand & Complexity Diagnostics & Troubleshooting Tools based on Customer & TS Requirements Customer Challenges Visibility into System Health Time To Resolution

More information

Molecular Biology: DNA sequencing

Molecular Biology: DNA sequencing Molecular Biology: DNA sequencing Author: Prof Marinda Oosthuizen Licensed under a Creative Commons Attribution license. SEQUENCING OF LARGE TEMPLATES As we have seen, we can obtain up to 800 nucleotides

More information

Mate-pair library data improves genome assembly

Mate-pair library data improves genome assembly De Novo Sequencing on the Ion Torrent PGM APPLICATION NOTE Mate-pair library data improves genome assembly Highly accurate PGM data allows for de Novo Sequencing and Assembly For a draft assembly, generate

More information

The More the Merrier: Efficient Multi-Source Graph Traversal

The More the Merrier: Efficient Multi-Source Graph Traversal The More the Merrier: Efficient Multi-Source Graph Traversal Manuel Then *, Moritz Kaufmann *, Fernando Chirigati, Tuan-Anh Hoang-Vu, Kien Pham, Huy T. Vo, Alfons Kemper *, Thomas Neumann * * Technische

More information

Ultrasequencing: Methods and Applications of the New Generation Sequencing Platforms

Ultrasequencing: Methods and Applications of the New Generation Sequencing Platforms Ultrasequencing: Methods and Applications of the New Generation Sequencing Platforms Laura Moya Andérico Master in Advanced Genetics Genomics Class December 16 th, 2015 Brief Overview First-generation

More information

Review of whole genome methods

Review of whole genome methods Review of whole genome methods Suffix-tree based MUMmer, Mauve, multi-mauve Gene based Mercator, multiple orthology approaches Dot plot/clustering based MUMmer 2.0, Pipmaker, LASTZ 10/3/17 0 Rationale:

More information

ACCELERATING GENOMIC ANALYSIS ON THE CLOUD. Enabling the PanCancer Analysis of Whole Genomes (PCAWG) consortia to analyze thousands of genomes

ACCELERATING GENOMIC ANALYSIS ON THE CLOUD. Enabling the PanCancer Analysis of Whole Genomes (PCAWG) consortia to analyze thousands of genomes ACCELERATING GENOMIC ANALYSIS ON THE CLOUD Enabling the PanCancer Analysis of Whole Genomes (PCAWG) consortia to analyze thousands of genomes Enabling the PanCancer Analysis of Whole Genomes (PCAWG) consortia

More information

Smarter Analytics for Big Data

Smarter Analytics for Big Data Smarter Analytics for Big Data Anjul Bhambhri IBM Vice President, Big Data February 27, 2011 The World is Changing and Becoming More INSTRUMENTED INTERCONNECTED INTELLIGENT The resulting explosion of information

More information

Single Nucleotide Variant Analysis. H3ABioNet May 14, 2014

Single Nucleotide Variant Analysis. H3ABioNet May 14, 2014 Single Nucleotide Variant Analysis H3ABioNet May 14, 2014 Outline What are SNPs and SNVs? How do we identify them? How do we call them? SAMTools GATK VCF File Format Let s call variants! Single Nucleotide

More information

Massively parallel assemblers for massively parallel DNA sequencers

Massively parallel assemblers for massively parallel DNA sequencers Massively parallel assemblers for massively parallel DNA sequencers Length: 1 hour Sébastien Boisvert Ph.D. student, Laval University CIHR doctoral scholar Élénie Godzaridis Strategic Technology Projects

More information

De Novo Assembly of High-throughput Short Read Sequences

De Novo Assembly of High-throughput Short Read Sequences De Novo Assembly of High-throughput Short Read Sequences Chuming Chen Center for Bioinformatics and Computational Biology (CBCB) University of Delaware NECC Third Skate Genome Annotation Workshop May 23,

More information

Assembly of Ariolimax dolichophallus using SOAPdenovo2

Assembly of Ariolimax dolichophallus using SOAPdenovo2 Assembly of Ariolimax dolichophallus using SOAPdenovo2 Charles Markello, Thomas Matthew, and Nedda Saremi Image taken from Banana Slug Genome Project, S. Weber SOAPdenovo Assembly Tool Short Oligonucleotide

More information

Mapping strategies for sequence reads

Mapping strategies for sequence reads Mapping strategies for sequence reads Ernest Turro University of Cambridge 21 Oct 2013 Quantification A basic aim in genomics is working out the contents of a biological sample. 1. What distinct elements

More information

Whole genome sequencing in drug discovery research: a one fits all solution?

Whole genome sequencing in drug discovery research: a one fits all solution? Whole genome sequencing in drug discovery research: a one fits all solution? Marc Sultan, September 24th, 2015 Biomarker Development, Translational Medicine, Novartis On behalf of the BMD WGS pilot team:

More information

GPU Accelerated Molecular Docking Simulation with Genetic Algorithms

GPU Accelerated Molecular Docking Simulation with Genetic Algorithms GPU Accelerated Molecular Docking Simulation with Genetic Algorithms Serkan Altuntaş, Zeki Bozkus and Basilio B. Fraguel 1 Department of Computer Engineering, Kadir Has Üniversitesi, Turkey, serkan.altuntas@stu.khas.edu.tr,

More information

Windows Server Capacity Management 101

Windows Server Capacity Management 101 Windows Server Capacity Management 101 What is Capacity Management? ITIL definition of Capacity Management is: Capacity Management is responsible for ensuring that adequate capacity is available at all

More information

Machine Learning. Genetic Algorithms

Machine Learning. Genetic Algorithms Machine Learning Genetic Algorithms Genetic Algorithms Developed: USA in the 1970 s Early names: J. Holland, K. DeJong, D. Goldberg Typically applied to: discrete parameter optimization Attributed features:

More information

Machine Learning. Genetic Algorithms

Machine Learning. Genetic Algorithms Machine Learning Genetic Algorithms Genetic Algorithms Developed: USA in the 1970 s Early names: J. Holland, K. DeJong, D. Goldberg Typically applied to: discrete parameter optimization Attributed features:

More information

Introduction to Bioinformatics and Gene Expression Technologies

Introduction to Bioinformatics and Gene Expression Technologies Introduction to Bioinformatics and Gene Expression Technologies Utah State University Fall 2017 Statistical Bioinformatics (Biomedical Big Data) Notes 1 1 Vocabulary Gene: hereditary DNA sequence at a

More information

E2ES to Accelerate Next-Generation Genome Analysis in Clinical Research

E2ES to Accelerate Next-Generation Genome Analysis in Clinical Research www.hcltech.com E2ES to Accelerate Next-Generation Genome Analysis in Clinical Research whitepaper April 2015 TABLE OF CONTENTS Introduction 3 Challenges associated with NGS data analysis 3 HCL s NGS Solution

More information

Next Generation Sequencing. Target Enrichment

Next Generation Sequencing. Target Enrichment Next Generation Sequencing Target Enrichment Next Generation Sequencing Your Partner in Every Step from Sample to Data NGS: Revolutionizing Genetic Analysis with Single-Molecule Resolution Next generation

More information

ERLANGEN REGIONAL COMPUTING CENTER

ERLANGEN REGIONAL COMPUTING CENTER ERLANGEN REGIONAL COMPUTING CENTER Components for practical performance engineering in a computing center environment: The ProPE project Jan Eitzinger Workshop on Performance Engineering for HPC: Implementation,

More information

Haplotype phasing in large cohorts: Modeling, search, or both?

Haplotype phasing in large cohorts: Modeling, search, or both? Haplotype phasing in large cohorts: Modeling, search, or both? Po-Ru Loh Harvard T.H. Chan School of Public Health Department of Epidemiology Broad MIA Seminar, 3/9/16 Overview Background: Haplotype phasing

More information

Algorithms for Bioinformatics

Algorithms for Bioinformatics Algorithms for Bioinformatics Compressive Genomics Ulf Leser Content of this Lecture Next Generation Sequencing Sequence compression Approximate search in compressed genomes Using multiple references This

More information

University of California at Berkeley College of Engineering Computer Science Division - EECS. Computer Architecture and Engineering Midterm II

University of California at Berkeley College of Engineering Computer Science Division - EECS. Computer Architecture and Engineering Midterm II University of California at Berkeley College of Engineering Computer Science Division - EECS CS 152 Fall 1995 D. Patterson & R. Yung Computer Architecture and Engineering Midterm II Your Name: SID Number:

More information

CS294: RISE Logistics, Overview, Trends

CS294: RISE Logistics, Overview, Trends CS294: RISE Logistics, Overview, Trends Joey Gonzalez, Joe Hellerstein, Raluca Popa, Ion Stoica August 29, 2016 2 Goal of this Class Bootstrap RISE research agenda Start new projects or work on existing

More information

SHENGYUAN LIU, JUNGANG XU, ZONGZHENG LIU, XU LIU & RICE UNIVERSITY

SHENGYUAN LIU, JUNGANG XU, ZONGZHENG LIU, XU LIU & RICE UNIVERSITY EVALUATING TASK SCHEDULING IN HADOOP-BASED CLOUD SYSTEMS SHENGYUAN LIU, JUNGANG XU, ZONGZHENG LIU, XU LIU UNIVERSITY OF CHINESE ACADEMY OF SCIENCES & RICE UNIVERSITY 2013-9-30 OUTLINE Background & Motivation

More information

USING HPC CLASS INFRASTRUCTURE FOR HIGH THROUGHPUT COMPUTING IN GENOMICS

USING HPC CLASS INFRASTRUCTURE FOR HIGH THROUGHPUT COMPUTING IN GENOMICS USING HPC CLASS INFRASTRUCTURE FOR HIGH THROUGHPUT COMPUTING IN GENOMICS Claude SCARPELLI Claude.Scarpelli@cea.fr FUNDAMENTAL RESEARCH DIVISION GENOMIC INSTITUTE Intel DDN Life Science Field Day Heidelberg,

More information

ECLIPSE 2012 Performance Benchmark and Profiling. August 2012

ECLIPSE 2012 Performance Benchmark and Profiling. August 2012 ECLIPSE 2012 Performance Benchmark and Profiling August 2012 Note The following research was performed under the HPC Advisory Council activities Participating vendors: Intel, Dell, Mellanox Compute resource

More information

Big Data and Real Time Analytics Streams and Hadoop

Big Data and Real Time Analytics Streams and Hadoop Big Data and Real Time Analytics Streams and Hadoop Infrastructure Matters 2014 Briefing 2014 IBM Corporation Big Data is more than just Hadoop What can you tell me about Big Data? I want to know all about

More information

2015 IBM Corporation

2015 IBM Corporation 2015 IBM Corporation Marco Garibaldi IBM Pre-Sales Technical Support Prestazioni estreme, accelerazione applicativa,velocità ed efficienza per generare valore dai dati 2015 IBM Corporation Trend nelle

More information

Applications of Big Data in Evidence-Based Medicine

Applications of Big Data in Evidence-Based Medicine Applications of Big Data in Evidence-Based Medicine Carolyn Compton, MD, PhD Professor Life Sciences, Arizona State University Professor Laboratory Medicine and Pathology, Mayo Clinic Adjunct Professor

More information

OptimoDE: Programmable Accelerator Engines Through Retargetable Customization

OptimoDE: Programmable Accelerator Engines Through Retargetable Customization OptimoDE: Programmable Accelerator Engines Through Retargetable Customization Nathan Clark, Hongtao Zhong, Kevin Fan, Scott Mahlke CCCP Research Group University of Michigan http://cccp.eecs.umich.edu

More information

1 st JILP Workshop on Computer Architecture Competitions (JWAC-1):

1 st JILP Workshop on Computer Architecture Competitions (JWAC-1): The Journal of Instruction-Level Parallelism 1 st JILP Workshop on Computer Architecture Competitions (JWAC-1): Cache Replacement Championship Held In Conjunction with ISCA 2010 Saint Malo, France Forward

More information

Bias Scheduling in Heterogeneous Multicore Architectures. David Koufaty Dheeraj Reddy Scott Hahn

Bias Scheduling in Heterogeneous Multicore Architectures. David Koufaty Dheeraj Reddy Scott Hahn Bias Scheduling in Heterogeneous Multicore Architectures David Koufaty Dheeraj Reddy Scott Hahn Motivation Mainstream multicore processors consist of identical cores Complexity dictated by product goals,

More information

Why can GBS be complicated? Tools for filtering, error correction and imputation.

Why can GBS be complicated? Tools for filtering, error correction and imputation. Why can GBS be complicated? Tools for filtering, error correction and imputation. Edward Buckler USDA-ARS Cornell University http://www.maizegenetics.net Many Organisms Are Diverse Humans are at the lower

More information

RODOD Performance Test on Exalogic and Exadata Engineered Systems

RODOD Performance Test on Exalogic and Exadata Engineered Systems An Oracle White Paper March 2014 RODOD Performance Test on Exalogic and Exadata Engineered Systems Introduction Oracle Communications Rapid Offer Design and Order Delivery (RODOD) is an innovative, fully

More information

IBM xseries 430. Versatile, scalable workload management. Provides unmatched flexibility with an Intel architecture and open systems foundation

IBM xseries 430. Versatile, scalable workload management. Provides unmatched flexibility with an Intel architecture and open systems foundation Versatile, scalable workload management IBM xseries 430 With Intel technology at its core and support for multiple applications across multiple operating systems, the xseries 430 enables customers to run

More information

Mapping by recurrence and modelling the mutation rate

Mapping by recurrence and modelling the mutation rate Current knowledge is from apping by recurrence and modelling the mutation rate Shamil Sunyaev Broad Institute of.i.t. and Harvard Comparative genomics Experimental systems: yeast reporter assays Potential

More information

Nature Genetics: doi: /ng Supplementary Figure 1

Nature Genetics: doi: /ng Supplementary Figure 1 Supplementary Figure 1 Processing of mutations and generation of simulated controls. On the left, a diagram illustrates the manner in which covariate-matched simulated mutations were obtained, filtered

More information

Lecture 6 Software Quality Measurements

Lecture 6 Software Quality Measurements Lecture 6 Software Quality Measurements Some materials are based on Fenton s book Copyright Yijun Yu, 2005 Last lecture and tutorial Software Refactoring We showed the use of refactoring techniques on

More information

100 Million Subscriber Performance Test Whitepaper:

100 Million Subscriber Performance Test Whitepaper: An Oracle White Paper April 2011 100 Million Subscriber Performance Test Whitepaper: Oracle Communications Billing and Revenue Management 7.4 and Oracle Exadata Database Machine X2-8 Oracle Communications

More information

Sequence Analysis Lab Protocol

Sequence Analysis Lab Protocol Sequence Analysis Lab Protocol You will need this handout of instructions The sequence of your plasmid from the ABI The Accession number for Lambda DNA J02459 The Accession number for puc 18 is L09136

More information

Alignment to a database. November 3, 2016

Alignment to a database. November 3, 2016 Alignment to a database November 3, 2016 How do you create a database? 1982 GenBank (at LANL, 2000 sequences) 1988 A way to search GenBank (FASTA) Genome Project 1982 GenBank (at LANL, 2000 sequences)

More information