The IBM Reference Architecture for Healthcare and Life Sciences Janis Landry-Lane IBM Systems Group World Wide Program Director janisll@us.ibm.com Doing It Right SYMPOSIUM March 23-24, 2017 Big Data Symposium March 24, 2017
The Era of Genomics Represents BIG DATA Big Data Symposium March 24, 2017 2
A New Era of Precision Healthcare Completion of the Human Genome Project in 2003 led to an expansion of research on the contributions of genomics in disease diagnosis, treatment, and prevention Green, ED et al (2011). Charting a course for genomic medicine from base pairs to bedside. Nature 470: 204-213 Key Client Interests Early Discovery What biological or environmental factors are causing disease? Can we design diagnostics and drugs to improve patient outcomes? University Research / Pharmaceutical R&D Clinical Genomics What does my patient s genomic information tell me about the treatment I should select? Hospital Systems Big Data Symposium March 24, 2017 3
Key Technical IT Challenges Big Data Evolving Frameworks & Databases Data Silos International Collaboration Complex Workload Big Data Symposium March 24, 2017 4
Reference Architecture for Healthcare & Life Science Analytics Industry Applications Applications & Frameworks Data Repositories and Databases Workload Orchestration Software-Defined Infrastructure Compute & Storage Servers Flash Disk Optimize utilization of compute resources across the enterprise Enterprise Data Management Improve data access and optimize storage utilization across the enterprise Tape x86 POWER VM IT Administration On / Off premises Hybrid Cloud Big Data Symposium March 24, 2017 5
Reference Architecture for Healthcare & Life Science Analytics Sample Workloads Big Data Repository Workload Orchestration Management Enterprise Enterprise Data Management Data Management Compute & Storage Servers Clinical Informatics EMR Workflow Admin Flash LIMS EDW / UDMH Disk CPOE Tape Genomic Analysis POSIX IBM Spectrum Computing IBM Spectrum Storage x86 Image Analysis - - - - - - Resource Allocation - Workload Monitoring - Metadata Collection - Information Lifecycle Management High-Performance I/O Data Sharing POWER Cognitive Analytics NGS Ref Databases Imaging / PACS RIS Literature Ontologies Omics DW VNA Knowledge Base - HDFS - POSI Object On / Off premises Hybrid Cloud - VM Cluster Provisioning Metadata Collection IT Administration Big Data Symposium March 24, 2017 6
A Hybrid Cloud Architecture IBM designs a hybrid cloud architecture that supports seamless communication of workflows across on-premise and cloud environments On-premise infrastructure Cloud infrastructure Workloads On-Premise Cluster Spectrum LSF Secure VPN tunnel Cloud Resident Cluster Spectrum Scale (GPFS) IBM Elastic Storage IBM Aspera FASP Spectrum Scale AFM AFM Spectrum Scale (GPFS) IBM Elastic Storage Big Data Symposium March 24, 2017 7
IBM Products Offer Flexibility for Customers Single platform for workload management for automated resource sharing Platform Process Manager Workflows/Pipelines Platform Symphony Applications Hadoop MapReduce Apps MPI/Batch Applications Spark Applications App1 App2 Platform Symphony Scheduler Job 1 Job 2 Spectrum Conductor for Spark Job 1 Job 2 Spectrum LSF Scheduler IBM Platform Computing - Resource Orchestration and Monitoring App1 App2 Spectrum Conductor for Spark IBM Spectrum Scale File System / Data Store Connectors POSIX NFS HDFS Object Flash Disk Storage rich servers Tape Single File System for POSIX, NFS, HDFS access for efficient data Sharing Big Data Symposium March 24, 2017 8
Case Study #1: Major Genomics Provider in NY: Mission: Deliver analysis to support personalized treatment to individual cancer patients where the Standard of Care has failed Requirements: a data architecture that provides Management, Resiliency, Scalability, Economics, and Long Term Retention IBM provided: Spectrum Scale for performance, data management, and resiliency Spectrum Archive to support movement of 1.2 PB per month to tape A robust scheduler that supports multi-thread processing stream to accelerate compute and can efficiently process unpredictable data streams An overall lowest cost of managing 10 s of PB of online data and an archive with site diversity Big Data Symposium March 24, 2017 9
The Timeline: Major Genomics Center in NY: (TB) 8000 7000 6000 DR Extension 5000 Infrastructure Scales 4000 Infrastructure Scales Incrementally for 3000 Quickly at Low $ Add Planned Growth 2000 1000 0 Month 1 Month 4 Month 11 Month 21 Month 24 Initial Assessment 3 Spectrum Archive 3 Spectrum Scale TS4500/V7000 6 TS1150 Inc Air Gap Security (one tape copy) Variety of All Data Sources Not Known & Growth Unpredictable Revised Assessment Add 6 TS1150 Add V7000 disk Volume of Data Known & Growth Predictable New Project Demands Add 2 Spectrum Archive Add 2 Spectrum Scale Add TS4500 Exp frame Add 8 TS1150 Add V7000 +SSD for Metadata in Cluster 3PB Peak Throughput/mo Value of Data Known Organic Growth No change Site Diversity Ingest Rate TB/Month Mission Critical (Planned) System critical production operation Site diversity Second site is at a related institution, with high bandwidth connectivity (Internet 2) Big Data Symposium March 24, 2017 10
Case Study #2: Alberta Children s Hospital Research Inst. Mission: Provide Precision Medicine for a variety of childhood diseases, including Care for Rare consortium. Make available a robust platform for new discovery and model organisms. Requirements: A cost-effective, scalable platform for supporting the breaking down of silos IBM provided: Spectrum Scale for performance, data management, and resiliency with an archive Compute enhancements to support projects with the ability to aggregate data into a data model for advanced analytics A Global namespace that provided the ability to break down silos and share amongst many research groups A robust scheduler that supports both existing as well as additional compute paradigms An overall architecture with the ability to add incrementally. https://www.ibm.com/news/ca/en/2016/06/15/u599716r24585k82.html Big Data Symposium March 24, 2017 11
Case Study #3: Mount Sinai School of Medicine Mission: To achieve the most effective solution for genomic workloads without rearchitecting the industry-standard software, we performed a rigorous analysis of usage statistics, benchmarks and available technologies to design a system for maximum throughput. IBM provided: Spectrum Scale for performance, data management, and resiliency. IBM Flash tier supported by Spectrum Scale for optimization of workflow performance with this tier of SCRATCH to support small file sizes A robust scheduler that could handle 700,000 jobs in the queue Data management to move data from the Flash tier to disk as soon as the workflow completes A Global namespace to provide support for a large and growing faculty/staff user base An overall architecture with the ability to add incrementally. https://hpc.mssm.edu/files/sc15-bode-paper.pdf Big Data Symposium March 24, 2017 12
Lessons Learned: Customer Success is derived from: Managing explosive growth. Managing application performance especially with clinical diagnostics and patients for whom the standard of care has failed A robust Data management/metadata search system that allows scientists to retrieve past data is a must as they will ultimately update their research with new algorithms or the data required for reproducible results. A Global namespace that breaks down silos, and provides a single copy of the data. A robust scheduler that supports both existing as well as additional compute paradigms is essential to enable sharing of a single IT environment with a myriad of applications APPLICATIONS, they come and go, but ARCHITECTURE ENDURES Big Data Symposium March 24, 2017 13
Big Data Symposium March 24, 2017 14