Why would you NOT use public clouds for your big data & compute workloads?

Size: px

Start display at page:

Download "Why would you NOT use public clouds for your big data & compute workloads?"

Gyles Briggs
5 years ago
Views:

1 BETTER ANSWERS, FASTER. Why would you NOT use public clouds for your big data & compute workloads? Rob Futrick

2 Our Belief Increased access to compute creates new science, new solutions, and a better world. The Cloud is the way to do that. We make computation in the cloud productive at any scale. 1

3 Why NOT the Cloud for Big Compute? Just a few reasons we hear: I already own infrastructure. Cost Security Data Inertia MPI/BigData/GPU/etc workflow Cloud is too hard Stop 2

4 Different Perspectives Scientist / Engineer SysArchitect Stop Organizational 3

5 Why is Cloud Productivity Hard? Individual User perspective Easy to get a cloud server; hard to create a cloud workflow Data in the right place at the right time Learning to exploit unlimited compute Understanding & managing costs & usage 4

6 Why is Cloud Productivity Hard? SysAdmin perspective Multiple environments instead of just one Managing data placement and access On-boarding software suites / workflows Managing, simplifying and growing user access Exploiting disposable infrastructure across multiple clouds 5

7 Why is Cloud Productivity Hard? Organizational perspective Security Expenses & Budgeting Managing and optimizing across cloud providers Enabling an effective hybrid strategy 6

8 HPC or Big Data: Same Issues Classic HPC Small data creating big data A vs B Data Driven Sciences Big data creating small data 7 7

9 So who is using Cloud? 8

10 Cloud Workloads Running Today Financial Batch, Risk Modeling, Pricing BigData, NoSQL, Data Lakes & Analytics Life Sciences Batch, Genomics Manufacturing, O&G, Electronics Simulation 9

11 Broad Institute s Cancer Program 10

12 Computing Map for Future Cancer Research Efforts Machine learning workload to infer relationships among and between cancer cell line and gene/expression data sets that include: Hundreds of cancer cell lines, Information on the genetic mutations present in each cell line, Gene expression data showing which genes are more or less active under various conditions, Information about how various small molecules interact with the cell lines at both large and small scales. Each of these data sets is massively complex in its own right 11

13 The Problem To build this map for only several hundred samples on a single CPU would have required decades of computing. Internal process couldn t absorb SW and Compute requirements Researchers found themselves holding back from running certain calculations Would have required coordination across too many groups 12

14 Cloud-only Workflow The Broad GCP Internal Storage RESTful API Cloud Storage GCP Preemptible VMs Cost up to 70% off regular instances Last up to 24 hours When preempted, you get 30 seconds to wrap-up your work NFS Filer CycleCloud VM VM VM VM Univa Grid Engine (UGE) 13

15 GCP Cluster Requirements Category Description Requirement Number of Cores Amount of Memory (RAM) How many cores are required on a single node for the application? How much memory on a node (or per core) is required for the application? 1 per job ~5 GB per job Operating System (O/S) What operating system does the application need? Ubuntu 12 Libraries/Tools/Software Parallelization Cluster Storage What additional libraries, tools, and software are needed to be installed? Compilers? Commercial software? Can the application run in a parallel manner? If so, how (threaded, MPI, or multiple instances of the application)? If the application runs in parallel across many nodes, how many nodes are required? How much storage space will be required for each run (input, intermediate, and output files)? Analysis written in R using "party" and other R packages Simple scatter-gather job relationship Application is pleasantly parallel. Total Input 200 MBs / job (Note: Pulled from The Broad via a RESTful API during execution.) Shared Storage Does this storage have to be shared across all nodes? NFS storage required for intermediate files. Cloud storage used for initial staging. 14

Learning app across full cluster Manage Encryption, data routing, results Cancer Data App Orchestration,

16 The Solution Use GCP s Preemptible VMs n1-highmem and n1-standard instances across zones in a single region. Univa Grid Engine (UGE) as the batch scheduler Enable self-service job submission Orchestration of Machine Learning app across full cluster Manage Encryption, data routing, results Cancer Data App Orchestration, Job Submission Encrypt Route data to Cloud Return results 51,200 cores Encrypted Disk/Data Google Cluster 15

17 51,200 Cores in GCP 16

18 Running 17

19 Results 3 Decades of Computing in 6 hours for < $5, ,891 jobs across 51,200 core cluster TBs input data processed 200 MB/job 6x that in intermediate data 2 weeks from concept to project completion! Blog Link: 18

20 Metrics Metric Value Total instance hours 15,273 Total core hours 243,552 Total job count 340,891 Total job wallclock (hours) 104,099 Total job CPU time (hours) 100,841 Total cores at peak 51,184 Total instances at peak 3,199 Total RAM at peak 293 TB Total estimated price (1.75c/core) < $5,000 19

21 20

22 Estimate Biomass in South Sahara Intel Head in Clouds Challenge Award to Estimate Biomass in South Sahara Credit: Much of the content of these slides was provided by Dr. Daniel Duffy NASA 21

23 Estimate Biomass in South Sahara Challenge Using National Geospatial Agency (NGA) data to estimate tree and bush biomass over the entire arid and semi-arid zone on the south side of the Sahara Project Summary Estimate carbon stored in trees and bushes in arid and semi-arid south Sahara Establish carbon baseline for later research on expected CO 2 uptake on the south side of the Sahara Principal Investigators Dr. Compton J. Tucker, NASA Goddard Space Flight Center Dr. Paul Morin, University of Minnesota Reference: Tucker and Morin are extending earlier tree and bush mapping work published by Gonzalez, Tucker, and Sy entitled Tree density and species decline in the African Sahel attributable to climate in the Journal of Arid Environments in

24 Desired Full Zone of Study 23

How to Break Down the Data Polar circumference of the Earth = 40,008 KM 40,008 KM/360 latitude degrees = 111.

25 How to Break Down the Data Polar circumference of the Earth = 40,008 KM 40,008 KM/360 latitude degrees = KM/ latitude degree Equatorial circumference of the Earth = 40,075 KM 40,075 KM/360 longitude degrees = KM/long degree Single UTM Zone (5.91 long degrees by 12.0 lat degrees) 5.91 lon degree * KM/longitude degree = KM 12 lat degree * KM/latitude degree = 1, KM Zones Tiles Sub-Tiles Chunks 24

26 Workflow Hybrid Cloud NGA Data External to NASA (PGC, Digital Globe, hard drives) NCCS/NASA DataMan AWS S3 Shared File System NGA Data at NASA NCCS Science Cloud (Internal Cloud) VM VM VM Local Data Local Data Local Data Local Data VM VM VM VM CycleCloud HTCondor Batch System 25

27 Data Flow Situation: Ongoing data collection Time sensitive processing needed 1 Local S3 CycleCloud driven process: 1. Imagery downloaded, initial data reduction 2. Data auto-loaded to S3 storage 3. Job starts when min amount of data available 4. Results returned to local repository CycleCloud 26

28 AWS Cluster Requirements Category Description Requirement Number of Cores Amount of Memory (RAM) How many cores are required on a single node for the application? How much memory on a node (or per core) is required for the application? 1 per sub-tile Slightly more than 4 GB per sub-tile Operating System (O/S) What operating system does the application need? CentOS Libraries/Tools/Software Parallelization Cluster Storage What additional libraries, tools, and software are needed to be installed? Compilers? Commercial software? Can the application run in a parallel manner? If so, how (threaded, MPI, or multiple instances of the application)? If the application runs in parallel across many nodes, how many nodes are required? How much storage space will be required for each run (input, intermediate, and output files)? None; code written in python Inherently parallel processing of each scene and/or tile 100 s to a few 1,000 Total Input 8 TB Total Output Back to NCCS 2 TB ( approx. 25% of total input) Shared Storage Does this storage have to be shared across all nodes? Using S3 to move data to local VM storage; S3 used to store output 27

29 Test Runs in AWS Spot Market Input Data: ~8 TB Output Data: 2-3 TB Total output data is estimated to be 25% of the input data AWS Spot Market Markets where the price of compute changes based on supply and demand Reduce costs by 50% to 90% from on-demand instances You ll never pay more than your bid. When the market exceeds your bid you get 2 minutes to wrap up your work 28

30 Cost The entire test run: $80 Can do entire UTM zone for ~$250, 11 for ~$2,750 Cost for all 11 UTM Zones & 4 satellites: ~ $11,000 Storage cost? 29

31 Make This a Service Within the NCCS ADAPT High Performance Science Cloud Big Data In Results Out Commercial Clouds AWS MS Azure Other Burst gives NASA the agility and flexibility to get more science done than is possible on their private cloud. 30

32 The Growth of Climate Data 31

33 The Growth of Climate Data This is just for sub-saharan Africa! With sufficient high-res remote imagery of the earth, it would be possible to calculate the entire woody biomass of the Earth. That s a very big data problem. 32

34 The Future of Big Data and HPC at Exacale Analytics Intensive ADAPT Virtual Environment HPC and Cloud ~1,000 cores 5 PB of storage Designed for Big Data Analytics Mass Storage Tiered Storage Disk and Tape 45 PB of storage Designed for long-term storage and recall; not compute Future Exascale Environment Merging of HPC and Big Data Analytics Capabilities Ability for in-situ analytics throughout the environment known analytics and machine learning Discover HPC Cluster 80,000 cores 33 PB of storage Designed for Large-Scale Climate Simulations Computational Intensive 33

35 And there are many others 34

36 Why NOT the Cloud for Big Compute? 35

37 Why NOT the Cloud for Big Compute? Just a few reasons we hear: I already own infrastructure. Cost Security Data Inertia MPI/BigData/GPU/etc workflow Too complex Stop 36

38 Why the Cloud for Big Compute? Scientist / Engineer perspective SysArchitect perspective Organizational perspective 37

39 Why the Cloud for Big Compute? Individual User Highly skilled scientists/quants/designers/etc doing science and not IT. Remove wait times; capacity in minutes Simplified access to resources 38

40 Why the Cloud for Big Compute? SysAdmin Programmatic workflows linking cloud to internal processes Highly skilled staff no longer swapping hard drives, pulling cables User / Cloud service management Data management 39

41 Why the Cloud for Big Compute? Organizational Security Ability to match budgets to computational consumption On-demand capacity as OpEx: Take advantage of latest tech Remove datacenter optimization 40

42 The Time is Now! So how do I get there? 41

43 Making Clouds productive is hard 42

44 Cycle Makes Clouds Productive Internal User Data Workflow Rapid At Scale Test Compute Burst Capacity Scaling 43

45 Complete Cloud Workflow Control Internal Provisioning On-demand Spot pricing Multi-provider Configuration Pre-set or User-defined cluster types Workflow driven setup Monitoring Auto-scaling Job tracking Error Handling CycleCloud Teardown Reporting Usage tracking Auditing 44

46 BETTER 45