SAMPLE TITLE HERE. The Seven Bridges Cloud Ecosystem: Enabling Interoperable Data Access and Analysis. Liz Williams, PhD

Size: px
Start display at page:

Download "SAMPLE TITLE HERE. The Seven Bridges Cloud Ecosystem: Enabling Interoperable Data Access and Analysis. Liz Williams, PhD"

Transcription

1 The Seven Bridges Cloud Ecosystem: Enabling Interoperable Data Access and Analysis SAMPLE TITLE HERE 2018 Seven Bridges Liz Williams, PhD sevenbridges.com

2 The content of this presentation is solely the responsibility of Seven Bridges Genomics Inc and does not necessarily represent the official views of the National Cancer Institute or National Institutes of Health. 2

3 The Seven Bridges Cloud Ecosystem Enables Precision Medicine Data Infrastructure Interoperability Partnerships Users 3

4 The Seven Bridges Cloud Ecosystem Infrastructure Interoperability Partnerships 4

5 The Seven Bridges Platform Web Application API Data Infrastructure Task Execution API Core Platform Infrastructure Independent Core Services Data/Metadata Service Project Management User Management Authentication & Authorization System Monitoring Usage Logging Notification Service Backup Service Billing Management Task Execution Infrastructure Task Scheduler Job Management Layer Orchestration Layer Cloud Storage & Compute Resource Manager 5

6 Security & Compliance on the Seven Bridges Platform HIPAA-compliant on AWS and GCP deployments ISO 27001:2013 certified US Federal Information Security Management Act (FISMA) Moderate certification based on NIST Rev 4 controls for the CGC NIH Trusted Partner for the CGC Compliant with dbgap Security Best Practices US-EU Privacy Shield Program registered participant; preparing for GDPR Support for CAP, CLIA, and GxP best practices 6

7 Essential Features of an Interoperable Data Ecosystem Findable Accessible Interoperable Reusable + Collaborative Usable Reproducible Extendable Scalable 7

8 Essential Features of an Interoperable Data Ecosystem Collaborative Usable Reproducible Extendable Scalable Secure, customizable workspaces Managed billing User-friendly interface Easy data management Industrystandard bioinformatics pipelines Flexible & reproducible methods Automated & accessible task logs Developerfriendly tools Portable bioinformatics pipelines Scalable data storage Cloudoptimized computation 8

9 Growth of the Seven Bridges Cloud Ecosystem TCGA Pilot Program announced CAVATICA selected as NIH Kids First Data Resource Awarded NCI Cancer Genomics Cloud (CGC) Pilot contract Partnered with JAX to build NCI s PDXNet Data Commons Launched CAVATICA partnership with CHOP Selected for NIH Data Commons Pilot 2018 Launched the CGC Launched CAVATICA Registered 1000 th CGC user Logged 3000 th user & 450 th year of compute time on the CGC 9

10 Data in the Seven Bridges Cloud Ecosystem * * * Available by Q * 10

11 The Seven Bridges Cancer Genomics Cloud (CGC) An NCI Cancer Research Data Commons Cloud Resource 2018 Seven Bridges sevenbridges.com 11

12 The Seven Bridges CGC A Cloud Resource within the NCI Cancer Research Data Commons for secure storage, sharing & analysis of petabytes of public, multi-omic cancer datasets The Seven Bridges Cancer Genomics Cloud has been funded in whole or in part with Federal funds from the National Cancer Institute, National Institutes of Health, Contract No. HHSN C and Task Order No. 17X146 under Contract No. HHSN I. cancergenomicscloud.org 12

13 Accessibility User-friendly web interface Powerful RESTful API, Datasets API & object-oriented and userfriendly libraries in Python, R & Java Comprehensive online documentation & training resources Technical support from a team of expert scientists, bioinformaticians & engineers cancergenomicscloud.org 13

14 Collaboration Tools Secure and customizable private workspaces for management of collaborators, data, tools & analysis results Project description, note & notification features for communicating with collaborators around the world Automatically generated, durable records of input/output files, apps, versions & parameters for every task run on the platform cancergenomicscloud.org 14

15 Petabytes of Public Datasets 3 PB of multi-omic public datasets 20 PB of linked data 0.5 PB of private & derived data * * * * cancergenomicscloud.org * Anticipated availability 15

16 Interactive and Programmatic Query Tools Web- and API-based metadata query tools to explore the data landscape and build cohorts for analysis Semantic triple-store technology for dataset harmonization & cross-dataset query building cancergenomicscloud.org 16

17 Built-in Data Security Per-file, per-user permissions management for third-party controlled-access data A permissions management model extendable across datasets & data governance entities cancergenomicscloud.org 17

18 Tools To Connect Data Import Data to the Platform Command Line Uploader & CLI Seven Bridges Uploader (GUI) API import HTTP(S) / FTP import Connect the Platform to External Resources Mount projects from your desktop Connect Cloud Storage (Volumes API) SBFS (a FUSE-based file system) cancergenomicscloud.org 18

19 Tools To Analyze Data A curated collection of bioinformatics tools & workflows Optimized for speed & cost in the cloud Fully parameterized & customizable Accessible via the GUI & API cancergenomicscloud.org 19

20 Tools To Ensure Analytical Reproducibility Docker-containerized bioinformatics pipelines Automatically generated and accessible logs for every task run on the platform Tool & workflow versions Parameters Input & output files cancergenomicscloud.org 20

21 An Extendable Analysis Ecosystem SBFS to connect data on the platform to local applications Data Cruncher, a custom JupyterLab environment for interactive analysis, data visualization & implementation of custom tertiary analysis tools Files Instance cancergenomicscloud.org 21

22 Tools To Port Your Own Pipelines to the Platform An intuitive and flexible software development kit for developing and porting custom tools to the platform Conformance with community standards to ensure pipeline portability & reproducibility cancergenomicscloud.org 22

23 Value of the CGC Ecosystem to the Research Community ,000 + registered users from 60 + countries 347,000 + completed tasks representing years of total compute time Number of Tasks Run Completed Tasks Failed Tasks Jan 2016 Jul 2016 Jan 2017 Jul 2017 Jan 2018 cancergenomicscloud.org 23

24 The CGC Enables Scalable, Cost-Effective Research Case Study #1: TCGA Immune Response Working Group Collaborative analysis with members of the Immune Response Working Group of The Cancer Genome Atlas (TCGA) Research Network Outcome: cost-optimized (<$0.30/sample), high-throughput HLA typing across ~9,000 TCGA RNA-Seq (fastq) files Case Study #2: PanCancer Analysis of Whole Genomes (PCAWG) Study High-throughput, harmonized analysis by Seven Bridges of all tumor and matched genomes in the dataset (~1,350) Outcome: rapid generation of ~65,000 output files (including ~5,000 VCFs) totaling 725 TB Case Study #3: Independent Analysis on 45,000 Genomes High-throughput analysis of 45,000 bacterial genomes accessed from SRA via API and analyzed using a custom workflow Outcome: analysis completed in ~1 week by a novice CGC user with no substantive assistance from the CGC team cancergenomicscloud.org 24

25 The JAX-Seven Bridges PDXNet Data Commons An NCI-funded Resource for the Patient-Derived Xenograft Development and Trial Centers Research Network 2018 Seven Bridges sevenbridges.com 25

26 The JAX-Seven Bridges PDXNet Data Commons A cloud-based environment for secure storage, sharing & analysis of data for the Patient-Derived Xenograft Development and Trial Centers Research Network (PDXNet) The JAX-Seven Bridges PDX Data Commons and Coordination Center is funded in whole or in part with Federal funds from the National Cancer Institute, National Institutes of Health, Department of Health and Human Services, under Contract No. 1U24CA pdxnetwork.org/pdccc/ 26

27 The JAX-Seven Bridges PDXNet Data Commons Designed to: Connect the PDX Development and Trial Centers (PDTCs) & the Patient-Derived Model Repository (PDMR) Colocalize PDXNet data & bioinformatics resources to facilitate data harmonization, discovery & analysis Integrate data from individual PDTCs & pilot projects to inform preclinical trials Make PDXNet data & harmonized workflows FAIR and available to the broader research community pdxnetwork.org/pdccc/ 27

28 Key Features of the PDXNet Data Commons Collaborative Usable: Custom data sharing features to enable phased release of consortium datasets to PDXNet participants & to the public Reproducible: Use of Rabix & CWL for creating reproducible and portable workflows for consortium-wide data harmonization Extendable: Scalable Full integration with the Seven Bridges CGC to enable access to all available public datasets & bioinformatics resources A harmonized metadata model that enables increasingly complex queries across public and private datasets using existing data query tools pdxnetwork.org/pdccc/ 28

29 CAVATICA The NIH Common Fund Gabriella Miller Kids First Pediatric Data Resource 2018 Seven Bridges sevenbridges.com 29

30 CAVATICA & the Kids First Data Resource A cloud-based environment for secure storage, sharing & analysis of large volumes of genomic data from pediatric cancer & rare disease patients cavatica.org 30

31 CAVATICA & the Kids First Data Resource Designed to: Integrate data for multiple rare pediatric diseases across dozens of hospitals & clinical sites Colocalize consortium data & bioinformatics resources to facilitate data harmonization, discovery & analysis Make Kids First data & harmonized workflows FAIR and available to the broader research community cavatica.org 31

32 Key Features of CAVATICA & the Kids First Data Resource Collaborative Usable: Custom permissions management for fine-grained control of private dataset access Reproducible: Use of Rabix & CWL for creating reproducible and portable workflows for consortium-wide harmonization Extendable: Scalable Interoperability with the CGC to enable authorized access to public datasets A harmonized metadata model that enables queries across pediatric and adult datasets using existing data query tools cavatica.org 32

33 FAIR4CURES An NIH Data Commons Pilot Solution 2018 Seven Bridges sevenbridges.com 33

34 FAIR4CURES A data and standards ecosystem for making NIH data resources FAIR and for enabling secure data sharing & analysis in collaboration with the NIH Data Commons Pilot Phase Consortium (DCPPC) The FAIR4CURES project is funded in whole or in part with Federal funds from the National Institutes of Health. 34

35 FAIR4CURES Designed to: Be a cloud-agnostic platform for making distributed NIH data resources FAIR and available for analysis by the broader research community Establish community standards and generate resources for making digital objects FAIR Findable: GUIDs for digital objects A common metadata model for indexing & search Accessible: Standardized authentication / authorization Interoperable: Open API standards Cross-platform interoperability Reusable: GUIDs for digital objects 35

36 Key Features of FAIR4CURES Collaborative: GUIDs to promote data and tool publication & reuse Usable: Workspaces connected to multiple cloud providers to enable compute where the data live Reproducible: GUIDs to promote analytical reproducibility Extendable: Scalable A standardized authentication & authorization schema Open API standards & cross-platform interoperability A common metadata model that enables queries across increasingly diverse datasets & data types using existing data query tools 36

37 The Seven Bridges Cloud Ecosystem: Interoperable Data Access and Analysis to Drive Precision Medicine Infrastructure Interoperability Partnerships 37

38 Liz Williams, PhD 2018 Seven Bridges sevenbridges.com