High End Computing Context for Genomes to Life

Similar documents
Plant genome annotation using bioinformatics

Introduction to Bioinformatics

Drug Discovery on Cluster Farms

The Computational Impact of Genomics on Biotechnology R&D (sort of )

BIOINFORMATICS Introduction

Delivering High Performance for Financial Models and Risk Analytics

High peformance computing infrastructure for bioinformatics

Bioinformatics and computational tools

Following text taken from Suresh Kumar. Bioinformatics Web - Comprehensive educational resource on Bioinformatics. 6th May.2005

HTCaaS: Leveraging Distributed Supercomputing Infrastructures for Large- Scale Scientific Computing

LARGE DATA AND BIOMEDICAL COMPUTATIONAL PIPELINES FOR COMPLEX DISEASES

From Bench to Bedside: Role of Informatics. Nagasuma Chandra Indian Institute of Science Bangalore

High Performance Computing(HPC) & Software Stack

IBM Accelerating Technical Computing

Biology 644: Bioinformatics

MARINE BIOINFORMATICS & NANOBIOTECHNOLOGY - PBBT305

High-Performance Computing (HPC) Up-close

Overview of Health Informatics. ITI BMI-Dept

This place covers: Methods or systems for genetic or protein-related data processing in computational molecular biology.

Study on the Application of Data Mining in Bioinformatics. Mingyang Yuan

BIOINFORMATICS THE MACHINE LEARNING APPROACH

UAB Condor Pilot UAB IT Research Comptuing June 2012

NWChem Performance Benchmark and Profiling. October 2010

Engineering Genetic Circuits

NX Nastran performance

IBM xseries 430. Versatile, scalable workload management. Provides unmatched flexibility with an Intel architecture and open systems foundation

DEPEI QIAN. HPC Development in China: A Brief Review and Prospect

HPC Market Update. Peter Ungaro. Vice President Sales and Marketing. Cray Proprietary

Improve Your Productivity with Simplified Cluster Management. Copyright 2010 Platform Computing Corporation. All Rights Reserved.

Information Driven Biomedicine. Prof. Santosh K. Mishra Executive Director, BII CIAPR IV Shanghai, May

Optimization of the Selected Quantum Codes on the Cray X1

Computational Biology

Reverse-Engineering the Brain

ECLIPSE 2012 Performance Benchmark and Profiling. August 2012

Introduction to Bioinformatics

HRG Insight: IBM & Intel: Intelligent Choice for Life Sciences

Accelerating Motif Finding in DNA Sequences with Multicore CPUs

IBM HPC DIRECTIONS. Dr Don Grice. ECMWF Workshop October 31, IBM Corporation

Performance Optimization on the Next Generation of Supercomputers

Drug candidate prediction using machine learning techniques

BIOINFORMATICS AND SYSTEM BIOLOGY (INTERNATIONAL PROGRAM)

CMSE 520 BIOMOLECULAR STRUCTURE, FUNCTION AND DYNAMICS

ROAD TO STATISTICAL BIOINFORMATICS CHALLENGE 1: MULTIPLE-COMPARISONS ISSUE

Powering Artificial Intelligence with High-Performance Computing

STUDYING THE TRANSPORT OF POLLUTANTS IN THE ATMOSPHERE USING AN IBM BLUE GENE/P COMPUTER

2/23/16. Protein-Protein Interactions. Protein Interactions. Protein-Protein Interactions: The Interactome

Bioinformatics and Life Sciences Standards and Programming for Heterogeneous Architectures

ANSYS FLUENT Performance Benchmark and Profiling. October 2009

Inferring Gene Networks from Microarray Data using a Hybrid GA p.1

Simulating Biological Systems Protein-ligand docking

Our view on cdna chip analysis from engineering informatics standpoint

ECS 234: Introduction to Computational Functional Genomics ECS 234

Products & Services. Customers. Example Customers & Collaborators

Advancing the adoption and implementation of HPC solutions. David Detweiler

Introduction to BIOINFORMATICS

ECS 234: Introduction to Computational Functional Genomics ECS 234

The Integrated Biomedical Sciences Graduate Program

HST MEMP TQE Concentration Areas ~ Approved Subjects Grid

"Large-Scale Information Extraction for Biomedical Modelling and Simulation"

HPC Cloud Interactive User support. Floris Sluiter Project leader SARA computing & networking services

Introduction to BIOINFORMATICS

Buffalo Center of Excellence in Bioinformatics

ANSYS HPC ANSYS, Inc. November 25, 2014

Powering Disruptive Technologies with High-Performance Computing

Bayer Pharma s High Tech Platform integrates technology experts worldwide establishing one of the leading drug discovery research platforms

July, 10 th From exotics to vanillas with GPU Murex 2014

Genetics and Bioinformatics

Resisting obsolescence

The Manycore Shift. Microsoft Parallel Computing Initiative Ushers Computing into the Next Era

Biomedical Sciences Graduate Program

Introduction to 'Omics and Bioinformatics

CSC 121 Computers and Scientific Thinking

Bioinformatics 2. Lecture 1

WHEN PERFORMANCE MATTERS: EXTENDING THE COMFORT ZONE: DAVIDE. Fabrizio Magugliani Siena, May 16 th, 2017

<Insert Picture Here> Oracle Exalogic Elastic Cloud: Revolutionizing the Datacenter

GREG GIBSON SPENCER V. MUSE

Lecture 1. Bioinformatics 2. About me... The class (2009) Course Outcomes. What do I think you know?

Genomics Market Share, Size, Analysis, Growth, Trends and Forecasts to 2024 Hexa Research

Optimizing Outcomes in a Connected World: Turning information into insights

Complementary Elective Courses

Data representation for clinical data and metadata

locuz.com Professional Services Locuz HPC Services

Discoveries Through the Computational Microscope

The Commercial Use of Biodiversity: Access and Benefit Sharing in Practice.

Computational Challenges of Medical Genomics

A Model-Driven Architecture For Unifying, Powering and Accelerating Drug Discovery. Secant Technologies:

Learning theory: SLT what is it? Parametric statistics small number of parameters appropriate to small amounts of data

ccelerating Linpack with CUDA heterogeneous clusters Massimiliano Fatica

High Performance Computing & Analytics... De-Mystifying Cognitive: HPC/HPDA/AI Georgia Power User Group Atlanta, GA, 20 April 2017

Instructor: Jianying Zhang, M.D., Ph.D.; Office: B301, Lab: B323; Phone: (O); (L)

In silico prediction of novel therapeutic targets using gene disease association data

Semester : Thursday Friday. Subject. Subject. CE0308 Environmental. Foundation Engineering. CI Hydraulic.

Program overview. SciLifeLab - a short introduction. Advanced Light Microscopy. Affinity Proteomics. Bioinformatics.

ABSTRACT COMPUTER EVOLUTION OF GENE CIRCUITS FOR CELL- EMBEDDED COMPUTATION, BIOTECHNOLOGY AND AS A MODEL FOR EVOLUTIONARY COMPUTATION

HPC Market Update: Second Quarter. Copyright 2011 IDC. Reproduction is forbidden unless authorized. All rights reserved.

Biomedical Data Science

Extreme Performance computing at Sun/Oracle

Role of Bio-informatics in Molecular Medicine

IBM Grid Offering for Analytics Acceleration: Customer Insight in Banking

Jack Weast. Principal Engineer, Chief Systems Engineer. Automated Driving Group, Intel

Transcription:

High End Computing Context for Genomes to Life William J Camp, PhD Director Computers, Computation, and Math Center January 22, 2002 1

What s Happening with High End Computing? High-volume building blocks available Commodity trends reduce cost Assembling large clusters is easier than ever Incremental growth possible HPC market is small and shrinking (relatively) Performance of high-volume systems increased dramatically Web driving market for scalable clusters Hot market for high-performance interconnects (Commodity, Commodity, Commodity) 2

Commodity Value Propositions Price High-Performance Computing 1,000s units Enterprise servers 100,000s units PCs and Workstations 10,000,000s units Performance Best Price Performance Low cost to drive down the cost of simulation Flexible and adaptable to changing needs Manage and operate as a single distributed system High Volume Technologies Give Favorable Price-Performance 3

Cray-1 4

ncube-2 Massively Parallel Processor (1024) Figure 1: Two hypercubes of the same dimension, joined together, form a hypercube of the next dimension. N is the dimension of the hypercube. 5

Intel Paragon 1,890 compute nodes 3,680 i860 processors 143/184 GFLOPS 175 MB/sec network SUNMOS lightweight kernel 6

High Performance Computing at Sandia: Hardware & System Software ASCI Red: The World s First Teraflop Supercomputer 9,472 Pentium II processors 2.38/3.21 TFLOPS 7 400 MB/sec network Puma/Cougar lightweight kernel OS Cplant TM : The World s Largest Commodity Cluster ~2,500 Compaq Alpha Processors ~2.7 Teraflops Myrinet Interconnect Switches Linux-based OS (Portals)

Cplant Performance Relative to ASCI Red Intel TFLOPS Cplant 8

Molecular Dynamics Benchmark (LJ Liquid) Fixed problem size (32000 atoms) 9

Distributed & Parallel Systems Distributed systems Internet Legion\Globus Beowulf Berkley NOW Cplant ASCI Red Tflops Massively parallel systems heterogeneous homogeneous Gather (unused) resources Steal cycles System SW manages resources System SW adds value 10% - 20% overhead is OK Resources drive applications Time to completion is not critical Time-shared Bounded set of resources Apps grow to consume all cycles Application manages resources System SW gets in the way 5% overhead is maximum Apps drive purchase of equipment Real-time constraints Space-shared 10

Communication-Computation Balance for Past and Present Massively Parallel Supercomputers Machine Balance Factor (bytes/s/flops) Intel Paragon 1.8 Ncube 2 1 Cray T3E.8 ASCI Red.6 Cplant.1 But Does Balance Matter for Biology? 11

Biology A Field With Increasing Impact on High End Computing Why is it important to high-end computing? What effect is it having? 12

High-Throughput Experimental Techniques Are Revolutionizing Biological and Health Sciences Research DNA Sequencing Gene Expression Analysis With Microarrays Protein Profiling via High Throughput Mass Spectroscopy Protein-Protein Interactions Whole-Cell Response 13

The Ultimate Goal of Systems Biology Is Driving A Broad Range of Computing Requirements Bioinformatics: Accumulating data from high-throughput experiments followed by pattern discovery & matching. Molecular Biophysics & Chemistry Modeling Complex Systems (e.g. Cells) 14

Computing-for-the-Life Sciences: The Lay of the Land The Implications of the New Biology for High-End Computing are Growing IT market for life sciences forecast to reach $40B by 2004, e.g. Vertex Pharmaceuticals: 100+ Processor cluster, company featured in Economist article Celera builds 1 st tera-cluster for biotechnology speeds up genomics by 10x IBM, Compaq: $100 million investments in the life sciences market NuTec 7.5 Tflops IBM cluster (US, & Europe-planned) GeneProt Large-Scale Proteomic Discovery And Production Facility:1,420 Alpha processors. Blackstone Linux/Intel Clusters (Pfizer, Biogen, AstraZeneca, & 10-15 more on the way) 15

16 Computing For Life Sciences at the Terascale Consider One Example: In Silico Pharmaceutical Development? A Guess at the Computing Power Needed (in TeraOps) 1 10 100 1000 Sequence Genome Assemble Gene Finding Identification Annotate Gene to Protein Map Protein Protein Interaction Pathways Normal & Aberrant Function in pathway Structure Drug Targets Cellular Response Trivially Parallel Massively Parallel Bioinformatics Molecular Biophysics Complex Systems

Definitions Molecular Biophysics Biological molecular-scale physical and chemical challenges and phenomena (e.g. structural biology, docking, ion channel problem.) Bioinformatics Informatics challenges (I/O, data mining, pattern matching, parallel algorithms etc.) presented by high-throughput biology and opportunities of application of terascale computing Complex Systems Developing models (most likely computationally intensive) of complex biological systems (e.g. cells): Volume methods, Circuit Models (Biospice, Bio-Xyce)), stochastic dynamics (e.g. Mcell), informatics (e.g. AFCS) 5) unit op s/rule-based interactions, reductionism (as last resort) 17

Definitions Distributed Resource Management & Problem Solving Environments Developing understanding & collaborations to establish a role in developing the tools & environment to enable the use of terascale computing to produce rapid understanding of complex biological systems through the combined approaches of high-throughput experimental methods data parallel I/O system software meta OS simulation & complex system modeling integrated understanding People, data, software, hardware, algorithms distributed geographically, organizationally & institutionally The Bio-Grid 18

Some Examples (from Sandia s experience) 19

odor receptors heat shock VXInsight TM Analysis of Microarray Data oocyte cuticle G proteincoupled receptors ribosomal proteins actin, tubulin sperm sperm cuticle 20

GeneHunter Popular linkage analysis code from Whitehead Institute: does multipoint analysis (coupled marker effects) Computation/memory is exponential in size of pedigree, linear in number of markers. 15 person pedigree -> 24 bits -> vectors of length 2 24 CPU days on a workstation Ideal for explosion of genetic marker data (e.g. SNPs). Our project: a distributed-memory parallel GeneHunter. run on Cplant extend scale of run-able problems, both in memory and CPU 21

Gene Hunter: Parallel Performance Results 22

New Database Methods for Structure-Property Relationships QSAR Equation with Signature HIV-1 protease inhibitors binding affinities (pic50) 10 N tr = 137 11 10 pic50 (model) 9 8 7 6 pic50 (model) 9 8 7 6 5 4 4 5 6 PCA/MLR 16 var. (r MLR 76 var. (r pic50 (experiment) 7 8 2 = 0.841) 2 = 0.953) 9 10 Typical QSAR (Molconn-Z descriptors) 5 4 4 5 6 7 8 9 10 11 pic50 (exp) Signature descriptors (extended connectivity index) (H 3 N-C H 2 -COOHH) Glycine C= C N H H 23 O H O= H H H

Computational Molecular Biophysics Molecular Simulation Molecular Dynamics (MD) NVT, NVE Grand Canonical MD Reaction-ensemble MD Monte Carlo (MC) Grand Canonical MC Configurational Bias MC Gibbs-ensemble MC Transition state theory Molecular Theory Classical Density Functional Theory Electronic Structure Methods Local Density Approximation (LDA) Quantum Chemistry (HF etc.) Mixed Methods Quantum-MD (Car-Parinello) Quantum-CDFT Brownian Dynamics-CDFT 24

Ion Channel Model & Initial Results 5 (21 Å) exterior 3 =0.05 (1M) 1:1 elec 4 (17 Å) 5 (21 Å) 4 (17 Å) 4 (17 Å) 1.5 (6.4 Å) 5 (21 Å) interior 3 =0.005 (0.1M) 1:1 elec Current (pa) 60 40 20 0-20 -40 Calculated Ion Channel Current-Voltage Curve e/kt=3.21 ( =100 mv) qs 2/e=0.2 (0.01e/Å 2 ) (3 charges in 3D) e/kt=0-60 -300-200 -100 0 100 200 300 (mv) 25 Measured Ion Channel Current-Voltage Curve

Virtual Cell Project NCRR-funded center within the UConn Health Center, Center for Biomedical Imaging Technology National Resource for Cell Analysis & Modeling (Virtual Cell) http://www.nrcam.uchc.edu/ Sandia s Contributions efficient parallel implementation solving systems of stiff linear PDE s converting digitized images in 3d to meshable geometry 26

Virtual Cell Collaboration First Fully 3-D Simulation of the Ca 2+ Wave Transport and Reaction During Fertilization of the Xenopus Laevis Egg Meshed with Sandia s CUBIT technology and solved with Sandia s MP-SALSA massively parallel diffusion, transport, and reaction finite element code. 27

Meshing of Mitochondrial Cristae for 3-D ADP/ATP Transport Study Within Mitochondria 3d Mitochondria Cristae Geometry from Confocal Microscopy Computational Geometry 3d Hex Mesh for Finite Element Analysis 28

29 Another Possible Scenario Sequence Genome Assemble Gene Finding Identification Annotate Gene to Protein Map Protein Protein Interaction Pathways Normal & Aberrant Function in pathway Structure Drug Targets 1 10 100 100 1000 shortcut Bioinformatics

Terascale Supercomputers To Date 2 Sandia RedStorm 1.6 Cplant Balance Factor 1.2 0.8 0.4 0-0.4 ASCI Red 1.8 Tflops ASCI Red 3.2 Tflops 0 10 20 30 PSC & French ASCI Blue Pacific ASCI Blue Mtn Balance Factor = 1 ASCI White Teraflops ASCI Q $200M $25M 30

Big Pharma (& Biotech) Are Increasingly Driving The High-End Computing Market. Annual Sales (est.) $300B Historic Growth Rate 10-14% R&D Expenditures $62B ($26.4B in US) $4.6B external R&D to understand human genome 11% of sales in 1980 20% of sales in 2000 70% of patented drugs come off patent in the next 4 years. 80% average drop in sales revenue when patent expires. $600M average drug development cost. Diminishing pool of easy targets. 31

Computing-for-the-Life Sciences: The Lay of the Land The Implications of the New Biology for High-End Computing are Growing IT market for life sciences forecast to reach $40B by 2004, e.g. Vertex Pharmaceuticals: 100+ Processor cluster, company featured in Economist article Celera builds 1 st tera-cluster for biotechnology speeds up genomics by 10x IBM, Compaq: $100 million investments in the life sciences market NuTec 7.5 Tflops IBM cluster (US, & Europe-planned) GeneProt Large-Scale Proteomic Discovery And Production Facility:1,420 Alpha processors. Blackstone Linux/Intel Clusters (Pfizer, Biogen, AstraZeneca, & 10-15 more on the way) 32

Speeding up Informatics - Parallel BLAST framework Idea: Create a tool for running large parallel BLAST jobs (genome vs genome) on distributed memory cluster Problems with current approach: tens of 1000s of LSF jobs: can't control or schedule them, users don t know progress full database on every proc I/O is not managed Goals for parallel tool: scalable parallel performance minimize and measure memory/proc $$ minimize and measure time in I/O feedback & monitoring for users design framework for more than BLAST 33

Basic Idea Large query file & database = many strings. One file chunk per column, one per row. BLAST -> N 2 subsets of work to do => reduced memory per proc. Others have looked at this as well 34

Our Implementation of the Idea Master scheduler: runs on 1 proc breaks up sequence analysis problem into M x N pieces schedules slave jobs intelligently run on all procs BLAST (or other tools) pre- and post-processors launch slave job on particular proc messages back to master detect when slave is finished Slave codes: PVM or MPI: 35

Scheduler Intelligence Query and DB pre-processing have to be run before BLAST sub-job. Give a proc a new BLAST sub-job that doesn't require additional I/O. Slaves can write to local disk to minimize I/O. 4 procs in a ES40 box can share files & loaded memory (via UBC). 36

User monitoring Master writes status file on slave tasks JAVA app displays it: Timeline on job history on compute farm. Stats on parallel performance, load-balance, individual jobs & procs. Add instrumentation of BLAST for memory/cpu stats. 37

Advantages Reduced memory use per processor: each BLAST task uses small portion of database and query files can be as small as desired User feedback JAVA app. Parallel efficiency & scalability: fine granularity of BLAST sub-jobs -> load-balance overall throughput should be much greater and not dependent on high degree of user sophistication Reduced I/O: local vs NFS disks sqrt(p) effect 38

39 There May Be Other Ways to Attack This Problem: Current Timeline? 1 10 100 1000 Sequence Genome Assemble Gene Finding Identification Annotate Gene to Protein Map Protein Protein Interaction Pathways Normal & Aberrant Function in pathway Structure Drug Targets Today 2020--2030?

40 High-Throughput Experiments May Create a Shortcut SNPs disease association from clinical and genetic history Existing clinical data and tissue banks could be CRITICAL because Sequence Genome Assemble Gene Finding Identification Annotate Gene to Protein Map Protein Protein Interaction Pathways Normal & Aberrant Function in pathway Structure Drug Targets 1 10 100 100 1000 shortcut

41 Computing Power Needs May Decrease with Clinical Collaborations 1 10 100 100 1000 Sequence Genome Assemble Gene Finding Identification Annotate Gene to Protein Map Protein Protein Interaction Pathways Normal & Aberrant Function in pathway Structure Drug Targets shortcut Today 2005--2015?

Summary There are bigger problems in GtL and similar efforts than any computer can handle now or in the next 10 years However, we need to develop the computational tools and Frameworks that take us from genomics to proteomics and beyond to Whole cell/ whole organism response This will require a uniquely challenging juxtaposition of high-throughput experimental methods grid-based bio-informatics sophisticated simulation of info and energy flows, of structure and function, and of environmental reactions 42