USING HPC CLASS INFRASTRUCTURE FOR HIGH THROUGHPUT COMPUTING IN GENOMICS

Similar documents
'Bioinformatics in academia as related to ehealth' - including the "Genomic Virtual Lab"

NextSeq 500 System WGS Solution

SciLifeLab Bioinformatics Platform National Bioinformatics Infrastructure Sweden (NBIS)

Towards Seamless Integration of Data Analytics into Existing HPC Infrastructures

Why would you NOT use public clouds for your big data & compute workloads?

FRAUNHOFER INSTITUTE FOR INTERFACIAL ENGINEERING AND BIOTECHNOLOGY IGB NEXT-GENERATION SEQUENCING. From wet lab to dry lab complete sample analysis

G E N OM I C S S E RV I C ES

Bill Magro, Ph.D. Intel Fellow, Software and Services Group Chief Technologist, High Performance Computing

How much sequencing do I need? Emily Crisovan Genomics Core September 26, 2018

How much sequencing do I need? Emily Crisovan Genomics Core

Are Next-Generation HPC Systems Ready for Population-level Genomics Data Analytics?

Course Presentation. Ignacio Medina Presentation

Accelerating Your Big Data Analytics. Jeff Healey, Director Product Marketing, HPE Vertica

Experimental Design Microbial Sequencing

A Storage Outlook for Energy Sciences:


From Lab Bench to Supercomputer: Advanced Life Sciences Computing. John Fonner, PhD Life Sciences Computing

Cray Earth Sciences Update. Phil Brown Earth Sciences Segment Leader icas, 17 th September 2015, Annecy

Korilog. high-performance sequence similarity search tool & integration with KNIME platform. Patrick Durand, PhD, CEO. BIOINFORMATICS Solutions

The Modular Supercomputer Architecture and its application in HPC and HPDA

Accelerate precision medicine with Microsoft Genomics

Accelerating Precision Medicine with High Performance Computing Clusters

HPC IN THE CLOUD Sean O Brien, Jeffrey Gumpf, Brian Kucic and Michael Senizaiz

LARGE DATA AND BIOMEDICAL COMPUTATIONAL PIPELINES FOR COMPLEX DISEASES

Addressing the I/O bottleneck of HPC workloads. Professor Mark Parsons NEXTGenIO Project Chairman Director, EPCC

HPC with Hybrid Clouds for Critical Big Compute workloads. by Denis Caromel, CEO & Founder, ActiveEon

ngs metagenomics target variation amplicon bioinformatics diagnostics dna trio indel high-throughput gene structural variation ChIP-seq mendelian

ATEA YOUR STRATEGIC PARTNER FOR MICROSOFT EDUCATON

Lecture 8: Predicting and analyzing metagenomic composition from 16S survey data

NGS part 2: applications. Tobias Österlund

TREE CODE PRODUCT BROCHURE

Introduction to Microbial Sequencing

IBM Power Systems. Bringing Choice and Differentiation to Linux Infrastructure

On Cloud Computational Models and the Heterogeneity Challenge

HiSeqTM 2000 Sequencing System

Genomic Technologies. Michael Schatz. Feb 1, 2018 Lecture 2: Applied Comparative Genomics

HPC Solutions. Marc Mendez-Bermond

Shannon pipeline plug-in: For human mrna splicing mutations CLC bio Genomics Workbench plug-in CLC bio Genomics Server plug-in Features and Benefits

Research Report. The Major Difference Between IBM s LinuxONE and x86 Linux Servers

CityEyes Appliance. Overview. CityEyes Appliances (CEA) Enterprise Video Mining and Video Management Solution. Simple Configuration and Management

SYMPOSIUM March 22-23, 2018

CHIA CENTER FOR HEALTH INFORMATION AND ANALYTICS RESEARCH, SOFTWARE, AND SERVICES

Lecture 8: Predicting metagenomic composition from 16S survey data

CFD Workflows + OS CLOUD with OW2 ProActive

NOW GENERATION SEQUENCING. Monday, December 5, 11

IBM Accelerating Technical Computing

Spectrum Scale Performance Tools Deployment at Nuance

Microsoft Azure Açık Kaynak Teknoloji Çözümleri

Case of predictive maintenance by analysis of acoustic data in an industrial environment

Achieving Agility and Flexibility in Big Data Analytics with the Urika -GX Agile Analytics Platform

resequencing storage SNP ncrna metagenomics private trio de novo exome ncrna RNA DNA bioinformatics RNA-seq comparative genomics

MS Integrating On-Premises Core Infrastructure with Microsoft Azure

1 Copyright 2011, Oracle and/or its affiliates. All rights reserved.

Integrating MATLAB Analytics into Enterprise Applications

Big data and HPC on-demand: Large-scale genome analysis on Helix Nebula the Science Cloud

SDSC 2013 Summer Institute Discover Big Data

Next Generation Bioinformatics on the Cloud

Real-time IoT Big Data-in-Motion Analytics Case Study: Managing Millions of Devices at Country-Scale

DNA-Based Digital Storage

Introduction to iplant Collaborative Jinyu Yang Bioinformatics and Mathematical Biosciences Lab

Advancing the adoption and implementation of HPC solutions. David Detweiler

Next Generation Sequencing. Tobias Österlund

Genopole Summer School Bioinformatic and biostatistic tools in medical genomics

Bringing AI Into Your Existing HPC Environment,

The IBM Reference Architecture for Healthcare and Life Sciences

Applications of Next Generation Sequencing in Metagenomics Studies

Yale Center for Research Computing. High Performance Computing Cluster Usage Guidelines

RNA sequencing with the MinION at Genoscope

ALCF SITE UPDATE IXPUG 2018 MIDDLE EAST MEETING. DAVID E. MARTIN Manager, Industry Partnerships and Outreach. MARK FAHEY Director of Operations

Nuts & Bolts of Informatics Infrastructure for Clinical Next Generation Sequencing (NGS)

ST 591: Introduction to Quantitative Genomics Syllabus

Getting Started with vrealize Operations First Published On: Last Updated On:

The 150+ Tomato Genome (re-)sequence Project; Lessons Learned and Potential

Deep Learning Acceleration with

High Performance Computing Workflow for Protein Functional Annotation

Parallels Virtuozzo Containers

Chapter 2: Access to Information

Introduction. Highlights. Prepare Library Sequence Analyze Data

InfoSphere DataStage Grid Solution

Creating an Enterprise-class Hadoop Platform Joey Jablonski Practice Director, Analytic Services DataDirect Networks, Inc. (DDN)

Lustre Beyond HPC. Toward Novel Use Cases for Lustre? Presented at LUG /03/31. Robert Triendl, DataDirect Networks, Inc.

Assembling a Cassava Transcriptome using Galaxy on a High Performance Computing Cluster

Big Data in Cloud. 堵俊平 Apache Hadoop Committer Staff Engineer, VMware

Microsoft FastTrack For Azure Service Level Description

Genomics. Data Analysis & Visualization. Camilo Valdes

ORACLE DATABASE PERFORMANCE: VMWARE CLOUD ON AWS PERFORMANCE STUDY JULY 2018

Oracle Cloud for the Enterprise John Mishriky Director, NAS Strategy & Business Development

UCSC Genomics Institute Current Practices & Challenges of Genomics in the Cloud

Applicazioni Cloud native

ECLIPSE 2012 Performance Benchmark and Profiling. August 2012

Digital Transformation & olvency II Simulations for L&G: Optimizing, Accelerating and Migrating to the Cloud

SAP Cloud Platform Pricing and Packages

ANY SURVEILLANCE, ANYWHERE, ANYTIME DDN Storage Powers Next Generation Video Surveillance Infrastructure

ILLUMINA SEQUENCING SYSTEMS

The Expanded Illumina Sequencing Portfolio New Sample Prep Solutions and Workflow

Contact us for more information and a quotation

A2L2: an Application Aware Flexible HPC Scheduling Model for Low-Latency Allocation

Ask the right question, regardless of scale

Features and Capabilities. Assess.

ACCELERATING GENOMIC ANALYSIS ON THE CLOUD. Enabling the PanCancer Analysis of Whole Genomes (PCAWG) consortia to analyze thousands of genomes

Transcription:

USING HPC CLASS INFRASTRUCTURE FOR HIGH THROUGHPUT COMPUTING IN GENOMICS Claude SCARPELLI Claude.Scarpelli@cea.fr FUNDAMENTAL RESEARCH DIVISION GENOMIC INSTITUTE Intel DDN Life Science Field Day Heidelberg, Oct 5th 2016 SCIENTIFIC COMPUTING LABORATORY PAGE 1

INSTITUTE OF GENOMICS De-novo sequencing Produce and analyze large amounts of genomic sequence data from various origins (human, plants, bacteria ) Develop bioinformatics tools for genome sequence analysis, annotation and exploration Exploration of prokaryotic biological and biochemical diversity, for chemical and environmental applications Sequencing multiples individuals from the same specie Whole genome association studies, pangenomic expression profiling, epigenetic studies, exome and whole genome sequencing Research on the genetic of human diseases through internal and collaborative research programs Searching for interaction between gene and environment Precision / personalized medicine Service for the academic community, through competitive call for projects. Co-founded collaborative projects In-house research projects : cross-fertilization between research and production PAGE 2

SEQUENCING INSTRUMENTS 5 HiSeq 2500 3 HiSeq 4000 1 HiSeqX5 Sequencing for biodiversity projects De-novo sequencing Metagenomic Human exome Gene transcription (RNAseq) CHiPseq / epigenetic Whole genome sequencing Up to 9,000 whole genome / year Third generation : single molecule, long reads, small footprint PAGE 3

FRANCE GENOMIQUE A distributed infrastructure for genomic analysis and bioinformatic, publicly funded by the French government, 8 sequencing facilities, including the two national centers, 15 bioinformatics laboratories, A dedicated part of an HPC class system, ran by CEA HPC center (4.000 cores, 3 large memory systems (3 TB), 5 PB storage (HSM) Objectives : sharing and leveraging expertise, providing high capacity resources for highly competitive projects, Technologies assessment (long reads, optical mapping ) methodology for sequencing and bioinformatics : assessment and development in relation to selected projects 9 work packages staffed with 45 temporary contracts for technology assessment and innovative developments in collaboration with the French Institute for Bioinformatics (IFB) Dedicated funding for large scale and high impact projects (direct costs), mainly on biodiversity and human genomic PAGE 4

GENOMIC VS CFD OR HEP FROM THE HPC CENTER PERSPECTIVE Genomic is definitely in the Big Data arena Big Data: Astronomical or Genomical?, Stephens ZD, Lee SY, Faghri F, Campbell RH, Zhai C, Efron MJ, et al. (2015), PLoS Biol 13(7): e1002195. doi:10.1371/journal.pbio.1002195, July 7th 2015 Data dynamic in Genomic : large raw data that expanse while processed (not true in HEP) Summary : genomic = Big Data (3(+) V (Volume, Velocity, variability )) + distributed computing Need to efficiently feed hundreds if not thousands of CPU with living data PAGE 5

SPECIFICATIONS FOR GENOMIC IN HPC Prerequisites for High Throughput Genomic: Engineering, facilities : power, weight, cooling, space Distributed computing, security, high performance IO Strategic partnership between the Genomic Institute and HPC teams, in order to take into account genomics' specific needs: Medium and long term storage of living data, Data parallelism (easy for an HPC center), genome and metagenome assembly, classification, counting: large memory machine, long run time, Almost unpredictable execution time, Package availability, support for multiple concurrent versions, Groupware work mode Next move: databases, data analytic Data generation (genomic center) Data storage and processing (HPC center)) Data presentation and exploration (genomic center) PAGE 6

TGCC S ARCHITECTURE CEA operates several HPC center, for Defense as well as for civilian applications 4 systems in the 1-100th rank of Top500 Expertise is shared among the systems Ready for exascale computing Data centric model: civilian system shares the same storage GL-TGCC 7.5 PB disk ST-TGCC Level 1: 1 PB disk Level 2: 30 PB tapes Cobalt: 38.000 cores, 1.5 Pflops, 539 kw, 63th Top500 Curie, thin nodes: 78.000 cores, 1.6 Pflops, PAGE 7

SOFTWARE INSTALLATION «standard» Linux system, Slurm resources manager «Modules» software management system, to deal with multiple versions, 220+ software installed (genomic field), including many that depends on interpreters (Python, Perl, R ) whose versions may be incompatible «Modules» also used for users directory selection (one user / multiple projects) $ module load extenv/fg (loads FG extension, ie modules shown above) $ module load dfldatadir/fgxxxx (loads FG specific variables, for easier access to per-project data directory) $ module avail $ module load samtools (loads default version of samtools) $ module load vcftools/0.1.12 (loads a specific version of samtools) PAGE 8

PASSTHRU FEATURE Main goal : to keep (large) files at only one place Web site (and authentication) located in users premise Read only access Technical details : reverse proxy PAGE 9

DAILY WORKLOAD Limit as much as possible data transfer : data are either stored locally or at the HPC premises, but not both Accommodates production as well as research/exploration workflow. Though not clinical-ready Daily genotyping center production for WES and WGS run at the HPC premises, de-novo sequencing (assembly) for genomes and metagenome (including eukaryotic) Example : Tara-Oceans projects (eukaryotic marine metagenomic and metatranscriptomic, 200 stations, 3 depth, 4 filter sizes), set up of a gene set of 180 millions entries, genomes and transcripts mapping (species abundance and functional activity). PAGE 10

FUTUR FOR THE HPC CENTER OS based checkpoint and restart feature for application that lacks this and have long runtime OS on demand, including Windows. With full data access. Enlarge list of monitored and bookable resources (memory, IO...). Better restitution time guarantees. From High Performance Computing to High Throughput Computing and Cloud features (on-demand, elasticity) Cloud computing compute Data sharing VM online Ondemand computing HPC PAGE 11

THANK YOU FOR YOUR ATTENTION. QUESTIONS? Présentation TGCC / CCRT 9 décembre 2011 12 12