Optimal Infrastructure for Big Data
|
|
- Mervin Miller
- 5 years ago
- Views:
Transcription
1 Optimal Infrastructure for Big Data Big Data 2014 Managing Government Information Kevin Leong January 22, VMware Inc. All rights reserved.
2 The Right Big Data Tools for the Right Job Real-time Streams (social, sensors) Machine Learning (Mahout, etc) Real-time Processing (S4, Storm, Spark) Data Visualization (Excel, Tableau) ETL (Informatica, Talend) Real Time Database (HBase, Casandra, Shark, GemFire) Interactive Analytics (HAWQ, Impala, Aster Data, Netezza) Hive Batch Processing (MapReduce) Structured and Unstructured Data (HDFS, MapR) Cloud Infrastructure Compute Storage Networking
3 An Infrastructure for Hadoop and Friends Hadoop production SQL on Hadoop HAWQ, Impala, Drill Compute Layer Data Layer Hadoop test/dev HDFS HBase NoSQL Cassandra Mongo Other Spark Shark Solr Platfora Some sort of distributed, resource management OS + File system Host Host Host Host Host Host Host
4 Standalone Integrated Customers Big Data Journey Getting from Here to There Stage 3: Cloud Analytics Platform Serve many departments Often part of mission critical workflow Fully integrated with analytics/bi tools Stage 2: Hadoop Production Serve a few departments More use cases Growing # and size of clusters Core Hadoop + components 0 node Stage1: Hadoop Piloting Often start with line of business Try 1 or 2 use cases to explore the value of Hadoop 10 s 100 s Scale 4
5 Operational Simplicity
6 What Users Want Data scientists, analysts, developers Line of business users Intimate with data and analysis, not IT Tasked with providing actionable intelligence that impacts key metrics Requirements Obtain a Hadoop cluster on demand Minimize time to insight Require reasonable performance from Hadoop cluster
7 Nick s Challenges I don t want to be the bottleneck when it comes to provisioning Hadoop clusters I want to better manage the jumble of departmental Hadoop clusters in my organization I need sizing flexibility, because my Hadoop users don t know how large of a cluster they need I don t really know that much about Hadoop I want to establish a repeatable process for deploying Hadoop clusters
8 Distilling Challenges into Key Themes Themes Automation of cluster lifecycle Complexity of running and tuning Hadoop clusters Shortage of skills Keeping big data clusters manageable Flexibility in sizing clusters Keeping up with demands of the business
9 Keep It Simple Operational simplicity Infrastructure Software Initial deployment Day 2 operations Scale Provision infrastructure Install software Tune configuration Customize Execute jobs
10 Flexibility in Cluster Creation Variable number of nodes Variable node size Ability to resize
11 Self-service Big Data Clusters Limit IT involvement in deploying big data clusters Just want to deliver infrastructure, not applications Hadoop-as-a-Service
12 Platform Efficiency
13 Adult Supervision Required CLUSTERS OVER 10 NODES
14 Challenges of Running Hadoop in the Enterprise Dept A: recommendation engine Production Dept B: ad targeting Production On the horizon NoSQL Real-time SQL Test Experimentation Log files Test Experimentation Pain Points: 1. Cluster sprawl 2. Redundant common data in separate clusters 3. Inefficient use of resources. Some clusters could be running at capacity while other clusters are sitting idle Transaction data Social data Historical cust behavior
15 Now Nick s Job Is to Drive Efficiencies I want to scale out when my workload requires it I want to get all Hadoop clusters into a centralized environment to minimize spend My Hadoop users ask for large Hadoop clusters, which end up underutilized I want to offer Hadoop-as-a- Service in my private cloud
16 A More Ideal Situation Recommendation engine Production Ad targeting Production One physical infrastructure to support multiple logical big data clusters Test Test Experimentation Test/Dev Experimentation Experimentation Production Recommendation Engine Production Ad Targeting
17 Containers with Isolation Are a Tried and Tested Approach Hungry Workload 1 Reckless Workload 2 Nosy Workload 3 Some sort of distributed, resource management OS + File system Host Host Host Host Host Host Host
18 Mixing Workloads: Three Types of Isolation Are Required Resource Isolation Control the greedy noisy neighbor Reserve resources to meet needs Version Isolation Allow concurrent OS, App, Distro versions Security Isolation Provide privacy between users/groups Runtime and data privacy required Some sort of distributed, resource management OS + File system Host Host Host Host Host Host Host
19 Dynamic Hadoop Scaling Deploy separate compute clusters for different tenants sharing HDFS. Commission/decommission compute nodes according to priority and available resources Job Tracker Job Tracker Compute layer Compute VM Compute VM Compute VM Compute VM Compute VM Compute VM Compute VM Compute VM Dynamic resourcepool Experimentation Production Production recommendation engine Data layer Virtualization Layer
20 Result: Increased Utilization Hadoop 1 Hadoop 2 HBase Consolidated cluster has access to entire pool of physical resources Take advantage of multi-tenancy to increase utilization during non-peak hours Reduce latency on priority jobs on consolidated cluster
21 A Generalized Platform for Big Data Hadoop production SQL on Hadoop HAWQ, Impala, Drill Compute Layer Data Layer Hadoop test/dev HDFS HBase NoSQL Cassandra Mongo Other Spark Shark Solr Platfora Some sort of distributed, resource management OS + File system Host Host Host Host Host Host Host
22 Integration into Existing Environment
23 Practical Considerations Necessitate a Flexible Platform I want to use my existing infrastructure, not buy new hardware I want a low-risk way of trying Hadoop I want to leverage the tools I already have My data is in shared storage; do I have to move it? Hadoop on public cloud is costing too much
24 Big Data Deployment Options Bare Metal Public Cloud Virtual How can I - Be responsive to business needs? - Increase utilization and control costs? - Leverage existing hardware and tools? - Reduce admin burden? - Maintain control of data? Do I have - Multiple clusters in my environment? - Short-lived clusters? - Clusters that run 24x7? - Predictable workloads? - Performance-sensitive workloads?
25 Use Storage That Meets Your Needs SAN Storage $2 - $10/Gigabyte NAS Filers $1 - $5/Gigabyte Local Storage $0.05/Gigabyte $1M gets: 0.5 Petabytes 200,000 IOPS 8Gbyte/sec $1M gets: 1 Petabyte 200,000 IOPS 10Gbyte/sec $1M gets: 10 Petabytes 400,000 IOPS 250 Gbytes/sec
26 Hybrid Storage Model to Get the Best of Both Worlds Master nodes: NameNode, JobTracker on shared storage Leverage VM movement and high availability technologies Slave nodes TaskTracker, DataNode on local storage Lower cost, scalable bandwidth Shared Storage Local Storage
27 Leveraging Isilon as External HDFS Time to results: Analysis of data in place Lower risk using existing enterprise storage/isilon Scale storage and compute independently Elastic Compute Layer Data Layer Hadoop on Isilon
28 Choose Your Own Adventure
29 Choose the Right Tool for the Job
30 Conclusion
31 Big Data Infrastructure Considerations Operational simplicity Ease of use Ease of management Platform efficiency Utilization Cost Integration into existing environment Flexibility Ability to grow with your needs 31
32 Hadoop on Virtualization Hadoop production SQL on Hadoop HAWQ, Impala, Drill Compute Layer Data Layer Hadoop test/dev Shared File System HBase NoSQL Cassandra Mongo Other Spark Shark Solr Platfora Virtualization Host Host Host Host Host Host Host
33 Thank You Kevin Leong