Science as a Service Accelerating Scientific Discovery using Cloud Ravi Madduri madduri@anl.gov Internet2 Global Summit 2016, Chicago
Outline Scientific Discovery Process The Case for Cloud Science as a Service Globus Research Data Management and Analysis Perspectives from NIH Success stories
Our vision for a 21st century discovery infrastructure Provide more capability for more people at lower cost by delivering Science as a Service www.globus.org
Scientific Discovery Process 4 Collect data Analyze data Pose question Design experiment Identify patterns Publish results Test hypothesis Hypothesize explanation
Eliminating data friction is essential to modern science Civilization advances by extending the number of important operations which we can perform without thinking about them (Whitehead, 1912) Obstacles to data access, movement, discovery, sharing, and analysis slow research, distort research directions, and waste time (DOE reports)
Imagine if a researcher, when tackling a problem, could easily: Assemble, integrate, and interpret all relevant data within a knowledge network Be informed of anomalies, patterns, gaps Formulate & apply computational models Outsource tasks if local expertise lacking Launch automated processes to test hypotheses, expand knowledge network Pay for all this by taking on other tasks
We will cover Accelerating Scientific Discovery Process by providing Science as a Service Research Data Management Analyzing Research Data Interactive Analysis Large-scale Analysis Publishing Results so others can Discover Validate Reproduce/Use
Cloud has transformed platforms and how software is delivered Software as a service: SaaS (web & mobile apps) Platform as a service: PaaS Infrastructure as a service: IaaS PaaS enables more rapid, cheap, and scalable delivery of powerful apps as SaaS 8
Our Science Stack Globus Galaxies Galaxy Interactive execution, ipython, R Creation, Execution, Sharing and Discovering Workflows Globus Data management Identity Management AWS HTCondor, Chef, EC2, EBS, S3, SNS Spot, Route 53, Cloud Formation SaaS PaaS IaaS
Managing big data with Globus Compute Facility Light Source Globus transfers files reliably, securely 4 Globus controls access to shared files on existing storage; no need to move files to cloud storage! 2 3 1 PI initiates transfer request; or requested automatically by script, science gateway SaaS à Only a web browser required Access using your campus credentials Globus monitors and informs throughout PI selects files to share, selects user or group, and sets access permissions Researcher logs in to Globus and accesses shared files; no local account required; download via Globus 7 Curator reviews and approves; data set published on campus or other system 6 5 Researcher assembles data set; describes it using metadata (Dublin core and domainspecific) Personal Computer 6 Publication Repository 8 Peers, collaborators search and discover datasets; transfer and share using Globus
Globus Platform-as-a-Service Globus APIs Sharing Service Transfer Service Globus Connect Identity, Group, Profile Management Services Globus Toolkit
Globus Adoption and Usage 166,449 active Globus endpoints 27,961 users registered Biggest transfer: 500.42TB Longest running transfer: 182 days. Fastest transfer: 58.5Gbps (average) 55TB moved per day, on average, since the service was launched in November 2010 Average throughput: 637.7Mbps (since service launch)
Analyzing Big Data using Globus Galaxies Sequencing Centers Seq Center Public Data Globus provides for High-performance Fault-tolerant Research Lab Secure file transfer between all data-endpoints Storage Globus Galaxies Local Cluster/ Cloud Galaxy Data Libraries Galaxy-based workflow management Fastq Picard Alignment GATK Variant Calling Globus integrated within Galaxy Ref Genome Web-based UI Drag-Drop workflow creations Easily modify workflows with new tools Analytical tools are automatically run on the scalable compute resources when possible Data management Globus Galaxies on Amazon EC2 Data analysis
Examples of Science as a Service Globus Genomics Large-scale NGS analysis PDACS - Portal for data analysis services for cosmological simulations CVRG Galaxy Large-scale ECG Data Analysis Globus Proteomics ematter Material Science Simulations FACE-IT - Framework to Advance Climate, Economic, and Impact Investigations with Information Technology (usefaceit.org)
Examples of what researchers have done using Globus Genomics
Examples in Genomics A profile of inherited predisposition to breast cancer among Nigerian women Y. Zheng, T. Walsh, F. Yoshimatsu, M. Lee, S. Gulsuner, S. Casadei, A. Rodriguez, T. Ogundiran, C. Babalola, O. Ojengbede, D. Sighoko, R. Madduri, M.-C. King, O. Olopade A case study for high throughput analysis of NGS data for translational research using Globus Genomics D. Sulakhe, A. Rodriguez, K. Bhuvaneshwar, Y. Gusev, R. Madduri, L. Lacinski, U. Dave, I. Foster, S. Madhavan
Globus Genomics at a glance 30 institutions, groups 2 PBs raw sequences analyzed 1000s genomes processed 5 days longest running workflow 10s million core hours labs >1000 analysis tools >50 workflows 99% uptime over the past two years >20 Publications 1 PB largest single transfer to do <1 day turnaround time 100s different species
Diversity of Collaborations Cox Lab Volchenboum Lab Olopade Lab
Costs are remarkably low Pricing includes Estimated compute Storage (one month) Globus Genomics platform usage Support
Some of the cloud-activities in NIH that we are involved in
NIH Commons Pilots minids unique identifiers and minimal metadata for digital objects data objects containers BagIT Registries/Indexes APIs Common Workflow Language for reproducible workflows BDDS Data Publication Services
Our work is supported by: U.S. DEPARTMENT OF ENERGY 25
Thank you! @madduri