Disclaimer This presentation may contain product features that are currently under development. This overview of new technology represents no commitme

Similar documents
Adobe Deploys Hadoop as a Service on VMware vsphere

Cloud Based Analytics for SAP

Accelerating Your Big Data Analytics. Jeff Healey, Director Product Marketing, HPE Vertica

E-guide Hadoop Big Data Platforms Buyer s Guide part 1

COMPARE VMWARE. Business Continuity and Security. vsphere with Operations Management Enterprise Plus. vsphere Enterprise Plus Edition

vsphere with Operations Management and vcenter Operations VMware vforum, 2014 Mehmet Çolakoğlu 2014 VMware Inc. All rights reserved.

Oracle Big Data Cloud Service

ENABLING GLOBAL HADOOP WITH DELL EMC S ELASTIC CLOUD STORAGE (ECS)

Hadoop Integration Deep Dive

UForge AppCenter 3.8. Introduction March Copyright 2018 FUJITSU LIMITED

Disclaimer This presentation may contain product features that are currently under development. This overview of new technology represents no commitme

Architecture Overview for Data Analytics Deployments

HP SummerSchool TechTalks Kenneth Donau Presale Technical Consulting, HP SW

Increased Informix Awareness Discover Informix microsite launched

Datametica DAMA. The Modern Data Platform Enterprise Data Hub Implementations. What is happening with Hadoop Why is workload moving to Cloud

Sr. Sergio Rodríguez de Guzmán CTO PUE

Big Data The Big Story

Analyze Big Data Faster and Store it Cheaper. Dominick Huang CenterPoint Energy Russell Hull - SAP

Top 5 Challenges for Hadoop MapReduce in the Enterprise. Whitepaper - May /9/11

MapR: Converged Data Pla3orm and Quick Start Solu;ons. Robin Fong Regional Director South East Asia

Microsoft Azure Essentials

Operational Hadoop and the Lambda Architecture for Streaming Data

20775: Performing Data Engineering on Microsoft HD Insight

Building Your Big Data Team

DELL EMC POWEREDGE 14G SERVER PORTFOLIO

1. Intoduction to Hadoop

EMC IT Big Data Analytics Journey. Mahmoud Ghanem Sr. Systems Engineer

Ensure Your Servers Can Support All the Benefits of Virtualization and Private Cloud The State of Server Virtualization... 8

Deloitte School of Analytics. Demystifying Data Science: Leveraging this phenomenon to drive your organisation forward

Bringing the Power of SAS to Hadoop Title

Oracle's Cloud Strategie für den Geschäftserfolg Alles Neue von der OOW

SAP Big Data. Markus Tempel SAP Big Data and Cloud Analytics Services

KnowledgeENTERPRISE FAST TRACK YOUR ACCESS TO BIG DATA WITH ANGOSS ADVANCED ANALYTICS ON SPARK. Advanced Analytics on Spark BROCHURE

Got Data Silos? Automate Data Ingestion Into Isilon In Support Of Analytics

Analytics in the Cloud, Cross Functional Teams, and Apache Hadoop is not a Thing Ryan Packer, Bank of New Zealand

Microsoft FastTrack For Azure Service Level Description

HP Cloud Maps for rapid provisioning of infrastructure and applications

[Header]: Demystifying Oracle Bare Metal Cloud Services

Aurélie Pericchi SSP APS Laurent Marzouk Data Insight & Cloud Architect

EBOOK: Cloudwick Powering the Digital Enterprise

SunGard: Cloud Provider Capabilities

Data Analytics. Nagesh Madhwal Client Solutions Director, Consulting, Southeast Asia, Dell EMC

DLT AnalyticsStack. Powering big data, analytics and data science strategies for government agencies

Application Integrator Automate Any Application

Disclaimer This presentation may contain product features that are currently under development. This overview of new technology represents no commitme

MapR Pentaho Business Solutions

Big Data & Hadoop Advance

Konica Minolta Business Innovation Center

What s New in SRM 6.1 December 15, 2017

Insights to HDInsight

REPENSEZ VOTRE STRATÉGIE SAP ET ENTREZ DANS LE CLOUD HYBRIDE

Datasheet FUJITSU Integrated System PRIMEFLEX for Hadoop

Welcome to. enterprise-class big data and financial a. Putting big data and advanced analytics to work in financial services.

The Microsoft Cloud Advantage. SAP on Azure. The Microsoft cloud advantage

LEVERAGING DATA ANALYTICS TO GAIN COMPETITIVE ADVANTAGE IN YOUR INDUSTRY

Cask Data Application Platform (CDAP)

IBM Virtual Appliance for Oracle Database

BUILDING A PRIVATE CLOUD

Apache Spark 2.0 GA. The General Engine for Modern Analytic Use Cases. Cloudera, Inc. All rights reserved.

Pentaho 8.0 and Beyond. Matt Howard Pentaho Sr. Director of Product Management, Hitachi Vantara

"Charting the Course... MOC A: Architecting Microsoft Azure Solutions. Course Summary

Hybrid Data Management

Altaro VM Backup vs. StorageCraft ShadowProtect

Analytics With Hadoop. SAS and Cloudera Starter Services: Visual Analytics and Visual Statistics

Enterprise APM version 4.2 FAQ

ETL on Hadoop What is Required

2017 IBM Elastic Storage Server - Update

Pentaho 8.0 Overview. Pedro Alves

Learn How To Implement Cloud on System z. Delivering and optimizing private cloud on System z with Integrated Service Management

Leveraging Oracle Big Data Discovery to Master CERN s Data. Manuel Martín Márquez Oracle Business Analytics Innovation 12 October- Stockholm, Sweden

Microsoft Big Data. Solution Brief

Discover the New Company

Integrating MATLAB Analytics into Enterprise Applications

1 Copyright 2011, Oracle and/or its affiliates. All rights reserved.

End User Computing. Redefining Application and Data Delivery to the Modern Workforce

New Ways to Leverage Open Source

BIG DATA and DATA SCIENCE

More information for FREE VS ENTERPRISE LICENCE :

Achieving Agility and Flexibility in Big Data Analytics with the Urika -GX Agile Analytics Platform

White paper A Reference Model for High Performance Data Analytics(HPDA) using an HPC infrastructure

Unlocking potential with SAP S/4HANA

ERP SYSTEM IN VIRTUALIZED PRODUCTION ENVIRONMENT

Why more and more SAP customers are migrating to Solaris

Nimble Storage vs Nutanix: A Comparison Snapshot

Oracle Cloud Blueprint and Roadmap Service. 1 Copyright 2012, Oracle and/or its affiliates. All rights reserved.

SAP Predictive Analytics Suite

Simplifying the Process of Uploading and Extracting Data from Apache Hadoop

Transforming Big Data to Business Benefits

Session 30 Powerful Ways to Use Hadoop in your Healthcare Big Data Strategy

Product Brief SysTrack VMP

E-guide Hadoop Big Data Platforms Buyer s Guide part 3

MapR: Solution for Customer Production Success

How to Build Your Data Ecosystem with Tableau on AWS

COMPANY PROFILE.

Lenovo ThinkSystem Solution for SAP Business Suite Applications

ORACLE DATA INTEGRATOR ENTERPRISE EDITION

RDMA Hadoop, Spark, and HBase middleware on the XSEDE Comet HPC resource.

SKYSCAPE HADOOP IN THE CLOUD

IBM Big Data Summit 2012

Governing Big Data and Hadoop

Transcription:

VIRT1400BU Real-World Customer Architecture for Big Data on VMware vsphere Joe Bruneau, General Mills Justin Murray, Technical Marketing, VMware #VMworld #VIRT1400BU

Disclaimer This presentation may contain product features that are currently under development. This overview of new technology represents no commitment from VMware to deliver these features in any generally available product. Features are subject to change, and must not be included in contracts, purchase orders, or sales agreements of any kind. Technical feasibility and market demand will affect final delivery. Pricing and packaging for any new technologies or features discussed or presented have not been determined. #VIRT1400BU CONFIDENTIAL 2

Agenda 1 Introductions 2 Why Enterprises are Deploying Big Data on vsphere, and how 3 Reference Architectures an Overview 4 Introduction to General Mills 5 Architecture Details from the General Mills Hadoop Deployment 6 Experiences in Deployment hurdles and how to overcome them 7 Best Practices - discovered along the way 8 Lessons Learned 9 Future Work 10 Conclusions #VIRT1400BU CONFIDENTIAL 3

Our Roles Joe is a Systems Administrator at General Mills and has worked with VMware for over a decade. He started out deploying VDI and Lab Manager and then on to virtualizing servers with the Windows team. After that he joined the enterprise infrastructure team to build the VMware landscape for the SAP migration from HPUX super domes to RHEL, implementing SRM and introducing Fault Tolerance. Joe also supports the fibre channel infrastructure and works on backup and DR. Justin is a senior Technical Marketing architect at VMware. He works closely with the company s customers and partners to help them deploy big data and analytics applications systems on VMware vsphere. He writes best practice documents and other material on these subjects and has spoken in public events on these technical topics. #VIRT1400BU CONFIDENTIAL 4

Why are Enterprises Deploying Big Data? They want to get off existing costly data platforms (for OLAP rather than OLTP) Older data warehouse technology is not serving their needs Want to do queries and analytics against many different forms of data (structured, unstructured, streaming, images, clickstream) Provide data access to their own customers with analytic tools (e.g. VMware telemetry data) Integrate systems that have been islands till now Single source of truth for the enterprise Exploit new application architectures for developer productivity Want to do data science, machine learning, deep learning to predict their customer behaviors or detect fraud at time of occurrence (as examples) #VIRT1400BU CONFIDENTIAL 5

Use Cases: Virtualization of Big Data Enterprises have development, test, pre-prod staging and production clusters that are required to be separated from each other and provisioned independently Organizations need different versions of Hadoop/Spark/machine learning platforms to be available to different teams - with possibly different services available (e.g. HBase, MapReduce, Spark) Enterprises do not wish to dedicate a specific set of hardware to each different requirement above, and want to reduce overall costs IT wants to provide Hadoop clusters as a service on-demand for its end users #VIRT1400BU CONFIDENTIAL 6

The Existing Hadoop Architecture Client ResourceManager Master Scheduler NameNode Submit job Master File System Index Worker Node 1 Worker Node 2 Worker Node 3 Nodemanager AppMaster - 1 Datanode Workers Nodemanager Datanode Nodemanager Container - 2 Container - 3 Datanode HDFS Block 1 HDFS Block 2 HDFS Block 3 #VIRT1400BU CONFIDENTIAL 7

Hadoop in Virtual Machines Job ResourceManager Input File Namenode Master File System Index Worker Node 1 Worker Node 2 Worker Node 3 Nodemanager AppMaster - 1 Master Scheduler Nodemanager Nodemanager Container - 2 Container - 3 Datanode Datanode Datanode HDFS Block 1 HDFS Block 2 HDFS Block 3 Key: A virtual machine #VIRT1400BU CONFIDENTIAL 8

High Level View of Apache Spark #VIRT1400BU CONFIDENTIAL 9

Deployment on vsphere Proven Architectures VMworld 2017 Content: Not for publication

Combined Model: Two Virtual Machines on a Host Virtualization Host Server Hadoop Node 1 Virtual Machine Ext4 VMDK Nodemanager Ext4 Ext4 VMDK VMDK Datanode Ext4 Ext4 Ext4 Hadoop Node 2 Virtual Machine Ext4 Nodemanager Datanode VMDK VMDK VMDK VMDK VMDK VMDK VMDK VMDK VMDK Ext4 Ext4 Ext4 Ext4 Ext4 Six Local DAS disks per Virtual Machine #VIRT1400BU CONFIDENTIAL 11

#1 Reference Architecture from Cloudera #VIRT1400BU CONFIDENTIAL 12

Data/Compute Separation (with External Access to HDFS) Hadoop Virtual Node 1 ResourceManager Virtualization Host Ext4 Ext4 OS Image OS VMDK Image OS VMDK Image VMDK VMDK VMDK VMDK Hadoop Virtual Node 2 NodeManager Temp Ext4 Ext4 HDFS requests Hadoop Virtual Node 3 Temp NodeManager Ext4 NN NN NN NN NN NN Ext4 Isilon data node #VIRT1400BU CONFIDENTIAL 13

Key Requirements for Big Data Architecture Performance Scaling to dozens or hundreds of nodes (VMs) Robustness distributed file system, no one process is a single point of failure High Availability Fault Tolerance Capable of handling new workloads with new compute demands #VIRT1400BU CONFIDENTIAL 14

A Customer Journey: General Mills Joe Bruneau

General Mills 1886 started as a flour mill on the banks of the Mississippi river as the Washburn Crosby Company 1928 General Mills is formed and starts trading on NYSE Acquired Pillsbury 2001 Over 100 locations internationally, 39,000 employees, 165 on Fortune 500 Brands Betty Crocker, Bisquick, Cheerios, Chex, Fiber One, Hamburger Helper, Nature Valley, Pillsbury, Yoplait, Gardettos, Gold Medal Flour, Haagen-Dazs, Old El Paso, Progresso, Cascadian Farms, Larabar, Muir Glen, Epic Provisions, Totinos, VMworld 2017 Content: Not for publication #VIRT1400BU CONFIDENTIAL 16

Big Data / Connected Data at General Mills Started looking at big data 2½ years ago Marketing and Global Consumer Insights Understanding Consumer Trends Understanding the impact of marketing and promotions on retail sales Attended Virtualizing Big Data sessions VMworld 2015 Worked with our TAM to set up a meeting with Justin #VIRT1400BU CONFIDENTIAL 17

Server Landscape Physical Landscape 30 node production cluster 700TB Virtual Landscape 15 physical ( 30 VM's) 60 TB #VIRT1400BU CONFIDENTIAL 18

Storage Benchmarks Physical vs Virtual Internal storage benchmarking tool #VIRT1400BU CONFIDENTIAL 19

TeraGen Benchmarking #VIRT1400BU CONFIDENTIAL 20

TeraSort Benchmarking #VIRT1400BU CONFIDENTIAL 21

Hadoop on vsphere #VIRT1400BU CONFIDENTIAL 22

CPU, Memory & Disk Configuration #VIRT1400BU CONFIDENTIAL 23

#VIRT1400BU CONFIDENTIAL 24

Datastores and Devices #VIRT1400BU CONFIDENTIAL 25

Applications Impala for interactive analysis using tools like Tableau Hive & Spark for batch processing Limited Machine Learning using python and R #VIRT1400BU CONFIDENTIAL 26

Business Applications That Hadoop is Used For Data sources are across the enterprise, from master data to transactional data, web activity analytics, supply chain, IoT, manufacturing analytics, consumer sentiment. We see the data lake as an enabler for enterprise wide analytics VMworld 2017 Content: Not for publication #VIRT1400BU CONFIDENTIAL 27

General Mills Big Data Cloudera Hadoop Tableau SAP BW on HANA SAP Data Services Data catalog / Data quality tools #VIRT1400BU CONFIDENTIAL 28

Cloudera Superior management tools Cloudera Navigator #VIRT1400BU CONFIDENTIAL 29

Initial Build Once the engineering technical specs were worked out and the hardware configured it took less than a day to build the VM infrastructure because we leveraged our existing VM deployment process It took about a year to configure and implement everything as we were working with a 3 rd party vendor as part of the solution VMworld 2017 Content: Not for publication #VIRT1400BU CONFIDENTIAL 30

Key Factors Cost Performance Using a hardware raid controller instead of a software raid controller Leverage existing virtual machine provisioning and deployment process Improve drive failure detection Ability to implement hot spare hard drives #VIRT1400BU CONFIDENTIAL 31

Best Practices Use a hardware raid controller Stick to NUMA boundaries Proper memory size of VM's Don't over allocate Good documentation for the Hadoop admins (where VM's are located) #VIRT1400BU CONFIDENTIAL 32

How Would You Do it Differently, If at All, Today? Develop automation to deploy and map VM's to specific Hadoop hardware configurations Had time permitted, I would have spent more time researching hardware options to simplify deployment and create a Hadoop as a Service infrastructure platform. or distribution #VIRT1400BU CONFIDENTIAL 33

What Does the Business Gain or Hope to Gain from the Big Data Systems? Who Are the End Users of This System? The business has a broader connected data strategy and Hadoop is the foundation for the strategy Business Analysts Automated Systems VMworld 2017 Content: Not for publication #VIRT1400BU CONFIDENTIAL 34

Future Plans What plans are there to expand the cluster(s) in the future? Build as needed How are the business needs/requests handled to increase the functionality of the system? Through the connected data steering team Any new technologies that you are interested in using? nvme storage with virtual hadoop Investigate HP Synergy platform #VIRT1400BU CONFIDENTIAL 35

How Does the Company Think about Big Data in the Private and Public Clouds as Cooperating with or Replacing Each Other? Long term we see the two working together in a hybrid way Public cloud will help us Google advanced analytics capabilities Amazon cloud only analytics capabilities #VIRT1400BU CONFIDENTIAL 36

Introducing vsphere Scale-Out for Big Data and HPC Workloads New package that provides all the core features required for scale-out workloads at an attractive price point Features Packaging Hypervisor, vmotion, vshield Endpoint, Storage vmotion, Storage APIs, Distributed Switch, I/O Controls & SR- IOV, Host Profiles / Auto Deploy and more Sold in Packs of 8 CPU at a cost-effective price point Licensing EULA enforced for use w/ Big Data/HPC workloads only 37

Conclusions Virtualized Big Data is in production today at customers sites such as those of General Mills Big Data requires some best practices at both the business level and the technical level VMware can help you on this journey jmurray@vmware.com bigdata@vmware.com http://www.vmware.com/big-data #VIRT1400BU CONFIDENTIAL 38