UCSC Genomics Institute Current Practices & Challenges of Genomics in the Cloud

Size: px
Start display at page:

Download "UCSC Genomics Institute Current Practices & Challenges of Genomics in the Cloud"

Transcription

1 UCSC Genomics Institute Current Practices & Challenges of Genomics in the Cloud Dr. Brian O Connor Director - Computational Genomics Platform UCSC Genomics Institute November 16th, 2017

2 Genomics Data Production Mutation Samples Sample prep Data collection on instrument Storage and Processing Data analysis/ Sharing A -> T

3 Genomics and Big Data Cost of sequencing dropped dramatically over the last 10 years Far exceeded Moore s law, putting enormous pressure on IT systems

4 The Cloud & Docker as Transformative Technologies Traditional HPC/On Prem Public & Private Clouds Virtual Machines Virtual Machines Virtual Machines Storage Service Virtual Network Encapsulate tools, settings, reference files, system libraries, Linux OS, etc

5 PCAWG: A Large-Scale, Distributed Analysis Effort First large-scale cancer genomics project to fully embrace clouds Organized by ICGC 48 projects contributing WGS on 20 primary sites 700+ scientists organized into 16 working groups ~5,800 Whole Genomes ~2,800 Cancer Donors ~1,300 with RNASeq data Goal was to consistently analyze data

6 PCAWG Cloud-Based Core Workflows Tech Working Group

7 PCAWG: Challenges of Distributed Analysis Extremely Distributed Effort! 8 sites storing and sharing data via GNOS 300TB -> 900TB 14 Cloud (and HPC) environments 3 Commercial, 7 OpenStack, 4 HPC ~630 VMs, ~16K cores, ~60TB of RAM

8 PCAWG Timeline Two-Year Compute Project origin in Summer 2014 Variant calling Jan 2015 Sanger May 2015 DKFZ/EMBL Sept 2015 Broad Concluded in Summer 2016 Papers in process Large-Scale Uniform Analysis of Cancer Whole Genomes in Multiple Computing Environments,

9 PCAWG Lessons Learned 1. Robust submission, validation, and cloud-based storage 2. Centralized project tracking, metrics, and monitoring 3. Bring algorithms to the data 4. Design systems to handle failure 5. Seek cloud-friendly policies 6. Leverage commercial clouds Many lessons learned point to a need for improved standards

10 Lesson 1: Robust Submission, Validation & Storage Globally distributed robust uploads were critical Validation in the uploader caught may data issues We had over 16K cores available to the project across academic clouds + HPC, our runtime data suggested inefficient utilization based on storage bottlenecks Storage limitations were a major bottleneck for analysis Typically we tried to stagger jobs carefully to avoid >20 simultaneous downloads

11 Lesson 2: Centralized Project Tracking With 8 storage sites and 14 cloud + HPC environments we needed a consistent way to track what was where We built a central metadata index and used this with Deciders and manual assignment of donors/samples to clouds

12 Lesson 3: Bring Algorithms to the Data Datasets were too large to move Used Docker to move algorithms to the data instead Containerized workflows for portability between sites

13 Lesson 4: Design Systems to Handle Failure Sequencing Projects Metadata Index GNOS Compute Cloud Orchestrator Academic Compute Clouds Spot Market Compute Cloud Orchestrator Commercial Cloud

14 Lesson 5: Seek Cloud-Friendly Policies PCAWG analysis showed the power of clouds Key policy changes enabled commercial cloud usage NIH updated dbgap cloud policy - March 2015 ICGC DACO updated ICGC cloud policy - May 2015 Partnerships with commercial cloud entities Amazon Public Datasets Program Storage donated with Compute at market rates Seven Bridges on AWS Transient Storage, Compute, and Consulting donated Azure via the BD2K Center for Big Data in Translational Genomics at UCSC Compute donated

15 PCAWG Analysis Architecture & AWS Sequencing Projects GNOS Compute Metadata Index Cloud Orchestrator Academic Compute Clouds S3 Spot Market Compute Cloud Orchestrator Amazon Cloud DNAnexus Seven Bridges Represented a major shift, ICGC data redistributed within Amazon s Cloud

16 Lesson 6: Leverage Commercial Clouds Workflow Hardware (cores / machine) Runtimes Cost on AWS BWA 8 cores (16 GB RAM) 5 days (± 5) per specimen $11.16 Sanger 8 cores (32 GB RAM) 4 days (± 3) per donor $17.22 DKFZ / EMBL 16 cores (64 GB RAM) 2 days (± 0.6) per donor $12.80 Broad 32 cores (256 GB RAM) 2.6 days per donor $20.48 workflow BWA storage required per donor 240 GB Sanger 4 GB DKFZ / EMBL 5 GB Total $62/donor 249 GB Data analysis: Create a cloud commons, Nature 2015

17 Lesson 6: Leverage Commercial Clouds While other environments came online Allowed early start Commercial clouds were available, stable, and costs were predictable

18 Accessing PCAWG Results Simple somatic calls accessible via full variant/bam files via Variants files - structural germline/somatic variants, simple germline/somatic mutations, copy number somatic mutations, & aligned reads Publication: Large-Scale Uniform Analysis of Cancer Whole Genomes in Multiple Computing Environments,

19 PCAWG Software Legacy Redwood Storage Dockstore Tools/Workflows Consonance Cloud Orchestration Toil Workflow Execution These are reusable software tools that provide key components of a generic PCAWG-based platform

20 Redwood - Cloud Storage System The Redwood Storage System (and underlying ICGC cloud object store) provided a fast/secure mechanism to store/use genomic data Example run of ~100 simultaneously downloads saw ~45-100MB/s Object store

21 Dockstore.org - Tool/Workflow Sharing Dockstore: Share tools and workflows Package tools with Docker, Describe with CWL/WDL, Couple with test data Highly portable

22 Consonance - Multi-Cloud/Region Work Queue Consonance is designed to provide work queues on a variety of Cloud environments

23 Toil - Efficient Workflow Execution A system for large-scale, workflow execution on AWS and other clouds, more efficient than PCAWG Per job granularity allows for better efficiency and robustness

24 Toil RNA-seq Recompute Toil recently completed a 20K sample, 30K core compute

25 Toil RNA-seq Recompute Toil recently completed a 20K sample, 30K core compute

26 Building the UCSC Computational Genomics Platform with PCAWG Tools

27 PCAWG Standards Legacy Software is important but creating standards is more so! Several opportunities for standards from PCAWG

28 GA4GH Tool Sharing API Tool Registry API Formalizing the standard with the GA4GH through the Cloud Work Stream, implemented in Dockstore Basic read API with extended support for write and search CWL/WDL Conventions descriptor Docker Tool(s) API Standard to Share GET list GET search POST register

29 GA4GH Analysis Execution API Workflow/Task Execution Service APIs Further work of the Cloud Work Stream Docker WDL/CWL Workflow or stderr + JSON POST new task Tools stdout API Standard to Execute status GET task status file(s) GET task stderr/stdout Cloud-specific Implementation

30 GA4GH Data Object Sharing API Emerging standards incubating in the Cloud Work Stream Object Store Central Elasticsearch Index Object Store Object Store Redwood API Boardwalk

31 GA4GH Ecosystem for Future Projects Dockstore Consonance/Toil Work through GA4GH & the NIH Data Commons Redwood

32 Future of Genomics PCAWG was a great example of a modern, cloud-based genomics analysis project Projects like TOPMed, HCA, All of Us, etc will sequence hundreds of thousands of genomes, producing 50+ petabytes of data in the next 5 years

33 Future of Genomics Most projects like HCA and TOPMed are cloud based, making hundreds of thousands of genomes available in highly scalable environments In the not too distant future sequencing may be routine in healthcare Millions of genomes may be routinely sequenced and analyzed Infrastructure will continue to be a key challenge!

34 Acknowledgements - PCAWG Researchers (700+) GA4GH Cloud Work Stream - Broad Institute - Cincinnati Children s Hospital - Curoverse - European Bioinformatics Institute - Intel - Institute for Systems Biology - Google, Microsoft, Amazon - Ontario Institute for Cancer Research - Oregon Health and Science University - Seven Bridges Genomics - University of California Santa Cruz Lincoln Stein, Josh Stuart, Gad Getz, Peter Campbell, Jan Korbel - PCAWG Tech Group Vincent Ferretti - Storage Denis Yuen - Dockstore Kyle Ellrott - Task API Peter Amstutz - Workflow API Jeff Gentry - GA4GH David Glazer - GA4GH Co-leader Hannes Schmidt, Benedict Paten & the Toil Team Walt Shands, Carlos Espinosa, & the UCSC CGP

35 Extra Slides

36 PCAWG Lessons Learned: Cloud Costs workflow hardware (cores / machine) runtimes BWA 8 cores (16 GB RAM) 5 days (± 5) per specimen Sanger 8 cores (32 GB RAM) 4 days (± 3) per donor DKFZ / EMBL 16 cores (64 GB RAM) workflow BWA 2 days (± 0.6) per donor storage required per donor 240 GB Sanger 4 GB DKFZ / EMBL 5 GB Total 249 GB Data analysis: Create a cloud commons, Nature 2015 ONTARIO INSTITUTE FOR CANCER RESEARCH

37 Lesson 6: Cloud Costs Workflow Hardware (cores / machine) Runtimes Cost on AWS BWA 8 cores (16 GB RAM) 5 days (± 5) per specimen $11.16 Sanger 8 cores (32 GB RAM) 4 days (± 3) per donor $17.22 DKFZ / EMBL 16 cores (64 GB RAM) 2 days (± 0.6) per donor $12.80 Broad 32 cores (256 GB RAM) 2.6 days per donor $20.48 workflow BWA storage required per donor 240 GB Sanger 4 GB DKFZ / EMBL 5 GB Total $62/donor 249 GB Data analysis: Create a cloud commons, Nature 2015

38 GA4GH Data Exchange & Cloud Standards GATTTATCTGCTCTCGTTG GAAGTACAAAATTCATTAAT GCTATGCACAAAATCTGTA G TAGTGTCCCATCTATT Dockstore Consonance/Toil Redwood Kevin Osborn

39 Human Cell Atlas To create comprehensive reference maps of all human cells the fundamental units of life as a basis for both understanding human health and diagnosing, monitoring, and treating disease. Infrastructure as a possible bridge project between the NIH Cloud Commons efforts and HCA, made interoperable through GA4GH API standards

40 PCAWG: Running Distributed Core Variant Calling Sanger 13 HPC & Clouds DKFZ/EMBL 7 HPC & Clouds Broad 3 HPC & Clouds

41 PCAWG Data Availability PCAWG Data on the AWS Cloud 1,432 PCAWG Donors - BAM (~70% of ICGC donors) - VCF from all three pipelines - more ICGC data uploaded regularly