Optimal Infrastructure for Big Data

Size: px
Start display at page:

Download "Optimal Infrastructure for Big Data"

Transcription

1 Optimal Infrastructure for Big Data Big Data 2014 Managing Government Information Kevin Leong January 22, VMware Inc. All rights reserved.

2 The Right Big Data Tools for the Right Job Real-time Streams (social, sensors) Machine Learning (Mahout, etc) Real-time Processing (S4, Storm, Spark) Data Visualization (Excel, Tableau) ETL (Informatica, Talend) Real Time Database (HBase, Casandra, Shark, GemFire) Interactive Analytics (HAWQ, Impala, Aster Data, Netezza) Hive Batch Processing (MapReduce) Structured and Unstructured Data (HDFS, MapR) Cloud Infrastructure Compute Storage Networking

3 An Infrastructure for Hadoop and Friends Hadoop production SQL on Hadoop HAWQ, Impala, Drill Compute Layer Data Layer Hadoop test/dev HDFS HBase NoSQL Cassandra Mongo Other Spark Shark Solr Platfora Some sort of distributed, resource management OS + File system Host Host Host Host Host Host Host

4 Standalone Integrated Customers Big Data Journey Getting from Here to There Stage 3: Cloud Analytics Platform Serve many departments Often part of mission critical workflow Fully integrated with analytics/bi tools Stage 2: Hadoop Production Serve a few departments More use cases Growing # and size of clusters Core Hadoop + components 0 node Stage1: Hadoop Piloting Often start with line of business Try 1 or 2 use cases to explore the value of Hadoop 10 s 100 s Scale 4

5 Operational Simplicity

6 What Users Want Data scientists, analysts, developers Line of business users Intimate with data and analysis, not IT Tasked with providing actionable intelligence that impacts key metrics Requirements Obtain a Hadoop cluster on demand Minimize time to insight Require reasonable performance from Hadoop cluster

7 Nick s Challenges I don t want to be the bottleneck when it comes to provisioning Hadoop clusters I want to better manage the jumble of departmental Hadoop clusters in my organization I need sizing flexibility, because my Hadoop users don t know how large of a cluster they need I don t really know that much about Hadoop I want to establish a repeatable process for deploying Hadoop clusters

8 Distilling Challenges into Key Themes Themes Automation of cluster lifecycle Complexity of running and tuning Hadoop clusters Shortage of skills Keeping big data clusters manageable Flexibility in sizing clusters Keeping up with demands of the business

9 Keep It Simple Operational simplicity Infrastructure Software Initial deployment Day 2 operations Scale Provision infrastructure Install software Tune configuration Customize Execute jobs

10 Flexibility in Cluster Creation Variable number of nodes Variable node size Ability to resize

11 Self-service Big Data Clusters Limit IT involvement in deploying big data clusters Just want to deliver infrastructure, not applications Hadoop-as-a-Service

12 Platform Efficiency

13 Adult Supervision Required CLUSTERS OVER 10 NODES

14 Challenges of Running Hadoop in the Enterprise Dept A: recommendation engine Production Dept B: ad targeting Production On the horizon NoSQL Real-time SQL Test Experimentation Log files Test Experimentation Pain Points: 1. Cluster sprawl 2. Redundant common data in separate clusters 3. Inefficient use of resources. Some clusters could be running at capacity while other clusters are sitting idle Transaction data Social data Historical cust behavior

15 Now Nick s Job Is to Drive Efficiencies I want to scale out when my workload requires it I want to get all Hadoop clusters into a centralized environment to minimize spend My Hadoop users ask for large Hadoop clusters, which end up underutilized I want to offer Hadoop-as-a- Service in my private cloud

16 A More Ideal Situation Recommendation engine Production Ad targeting Production One physical infrastructure to support multiple logical big data clusters Test Test Experimentation Test/Dev Experimentation Experimentation Production Recommendation Engine Production Ad Targeting

17 Containers with Isolation Are a Tried and Tested Approach Hungry Workload 1 Reckless Workload 2 Nosy Workload 3 Some sort of distributed, resource management OS + File system Host Host Host Host Host Host Host

18 Mixing Workloads: Three Types of Isolation Are Required Resource Isolation Control the greedy noisy neighbor Reserve resources to meet needs Version Isolation Allow concurrent OS, App, Distro versions Security Isolation Provide privacy between users/groups Runtime and data privacy required Some sort of distributed, resource management OS + File system Host Host Host Host Host Host Host

19 Dynamic Hadoop Scaling Deploy separate compute clusters for different tenants sharing HDFS. Commission/decommission compute nodes according to priority and available resources Job Tracker Job Tracker Compute layer Compute VM Compute VM Compute VM Compute VM Compute VM Compute VM Compute VM Compute VM Dynamic resourcepool Experimentation Production Production recommendation engine Data layer Virtualization Layer

20 Result: Increased Utilization Hadoop 1 Hadoop 2 HBase Consolidated cluster has access to entire pool of physical resources Take advantage of multi-tenancy to increase utilization during non-peak hours Reduce latency on priority jobs on consolidated cluster

21 A Generalized Platform for Big Data Hadoop production SQL on Hadoop HAWQ, Impala, Drill Compute Layer Data Layer Hadoop test/dev HDFS HBase NoSQL Cassandra Mongo Other Spark Shark Solr Platfora Some sort of distributed, resource management OS + File system Host Host Host Host Host Host Host

22 Integration into Existing Environment

23 Practical Considerations Necessitate a Flexible Platform I want to use my existing infrastructure, not buy new hardware I want a low-risk way of trying Hadoop I want to leverage the tools I already have My data is in shared storage; do I have to move it? Hadoop on public cloud is costing too much

24 Big Data Deployment Options Bare Metal Public Cloud Virtual How can I - Be responsive to business needs? - Increase utilization and control costs? - Leverage existing hardware and tools? - Reduce admin burden? - Maintain control of data? Do I have - Multiple clusters in my environment? - Short-lived clusters? - Clusters that run 24x7? - Predictable workloads? - Performance-sensitive workloads?

25 Use Storage That Meets Your Needs SAN Storage $2 - $10/Gigabyte NAS Filers $1 - $5/Gigabyte Local Storage $0.05/Gigabyte $1M gets: 0.5 Petabytes 200,000 IOPS 8Gbyte/sec $1M gets: 1 Petabyte 200,000 IOPS 10Gbyte/sec $1M gets: 10 Petabytes 400,000 IOPS 250 Gbytes/sec

26 Hybrid Storage Model to Get the Best of Both Worlds Master nodes: NameNode, JobTracker on shared storage Leverage VM movement and high availability technologies Slave nodes TaskTracker, DataNode on local storage Lower cost, scalable bandwidth Shared Storage Local Storage

27 Leveraging Isilon as External HDFS Time to results: Analysis of data in place Lower risk using existing enterprise storage/isilon Scale storage and compute independently Elastic Compute Layer Data Layer Hadoop on Isilon

28 Choose Your Own Adventure

29 Choose the Right Tool for the Job

30 Conclusion

31 Big Data Infrastructure Considerations Operational simplicity Ease of use Ease of management Platform efficiency Utilization Cost Integration into existing environment Flexibility Ability to grow with your needs 31

32 Hadoop on Virtualization Hadoop production SQL on Hadoop HAWQ, Impala, Drill Compute Layer Data Layer Hadoop test/dev Shared File System HBase NoSQL Cassandra Mongo Other Spark Shark Solr Platfora Virtualization Host Host Host Host Host Host Host

33 Thank You Kevin Leong