Operational Resiliency for a Virtualized Environment

Size: px

Start display at page:

Download "Operational Resiliency for a Virtualized Environment"

Shon Roberts
5 years ago
Views:

1 Operational Resiliency for a Virtualized Environment Peter Laz, MBCP, MBCI Managing Consultant Forsythe Brendan Foye Enterprise Account Manager Zerto AGENDA Operational Resiliency vs. Disaster Recovery Hypothetical Case Study Summary and Q&A 2

2 OPERATIONAL RESILIENCY VS. DISASTER RECOVERY TRADITIONAL DR MODEL Minimum acceptable level of performance Invoke alternate procedures to recover & resume operations following significant disruptive event Interruption in the following for a large percentage of apps and business functions: IT Service Degraded IT capability Operation/workflow Limited customer service OPERATIONAL RESILIENCY MODEL Optimum level of performance Architecture and processes for continuous availability of business operations and IT environments A much larger percentage of IT services and business functions experience: Continuous availability of IT Service End-to-end process is business as usual (appl. interdependencies, no workarounds) Full performance and capacity (IT & business functions) No customer service impact 3 TRADITIONAL DR SOLUTION VIEW PRODUCTION DATA CENTER Full production: (capacity and performance) Application HA within the data center only Metrics include capacity, performance and availability within the data center Used in response to single-site outage (out-of-region) Limited capacity & performance Primary metrics are RTO / RPO DR Site (Internal or Vendor Provided) 4

OPERATIONAL RESILIENCY SOLUTION VIEW PRODUCTION DATA CENTER 10 50 miles PRODUCTION DATA CENTER II Production

Metrics include capacity, performance and availability applied to both application and site-level outage events Used

Site (Internal or Vendor Provided) 5 CHALLENGE OF MOVING TO THE RESILIENT MODEL Controlling the proliferation of

3 OPERATIONAL RESILIENCY SOLUTION VIEW PRODUCTION DATA CENTER miles PRODUCTION DATA CENTER II Production applications at both sites, in-region Application HA capabilities within and between sites (converged HA & DR) Metrics include capacity, performance and availability applied to both application and site-level outage events Used in response to dual regional outage (out-of-region) Limited capacity & performance Primary metrics are RTO / RPO DR Site (Internal or Vendor Provided) 5 CHALLENGE OF MOVING TO THE RESILIENT MODEL Controlling the proliferation of technologies that arise to meet resiliency requirements is key, because they: $ DRIVE UP COST DRIVE UP RISK PRODUCE FUNCTIONAL GAPS 6

VIRTUAL CASE STUDY A mid-size enterprise with 100 VMs US and International Data Centers Legacy hardware, considering new/dissimilar flash array Lean IT team, one headcount dedicated to replication

4 VIRTUAL CASE STUDY A mid-size enterprise with 100 VMs US and International Data Centers Legacy hardware, considering new/dissimilar flash array Lean IT team, one headcount dedicated to replication High standards for RPO, RTO, SLA Regular audits 7 ACME ANVIL CORPORATION Chicago Co-lo DR site Remote Offices Remote Offices Denver West Region Hub St. Louis HQ Corporate Data Center Raleigh East Region Hub CURRENT TIER RTO RPO 1 24 hrs 24 hrs 2 48 hrs 24 hrs days 1 week 8

CHALLENGE: REDUCE COSTS Fat Data Space-consuming snapshots Not duped or compressed Replicating all VMs and VDRs, regardless of which you really care about (such as

increased functionality 9 CHALLENGE: MANAGEMENT COMPLEXITY Valuable headcount assigned to DR Manually re-mapping storage arrays Additional time-consuming tasks mean team

5 CHALLENGE: REDUCE COSTS Fat Data Space-consuming snapshots Not duped or compressed Replicating all VMs and VDRs, regardless of which you really care about (such as application groups) Overseas operations Requires big pipe Adds expense, complexity Duplicated hardware Back-up array always matches storage array Costs doubled without increased functionality 9 CHALLENGE: MANAGEMENT COMPLEXITY Valuable headcount assigned to DR Manually re-mapping storage arrays Additional time-consuming tasks mean team spends more hours in DR than production, such as: De-duping and compressing data Overseas pipeline management Managing refreshes After-hours audit support Restoring applications 10

CHALLENGE: MEET RPOS & RTOS Event-based challenges Failback and recovery General DR tests Application tests/development Multi-site, multi-country

6 CHALLENGE: MEET RPOS & RTOS Event-based challenges Failback and recovery General DR tests Application tests/development Multi-site, multi-country data transfer Functionality-based challenges Snapshot-only environment 11 FROM THE CTO More flexibility in hardware with less money Sick of spending time at DR site v. production site No DR downtime and No lost data Maintain SLAs & ensure RTOs and RPOs Implement all this yesterday Small learning curve Simple/short install that won t consume a lot of resources ELMER S MANDATE TIER RTO RPO 1 < 4 hrs < 30 mins 2 < 24 hrs < 30 mins 3 48 hrs 24 hrs 12

SOLUTION: FLASH AND HYPERVISOR-BASED DR Not everything is virtualized or hypervisor based All primary applications and data need protection, including (especially)

group/protection group Zerto at the granular/vm/vm groups/ cluster level 13 SOLUTION: DECREASE COSTS Streamline data with duping & compression Reduce workload on

7 SOLUTION: FLASH AND HYPERVISOR-BASED DR Not everything is virtualized or hypervisor based All primary applications and data need protection, including (especially) larger, non-x86 infrastructure One size does not fit all But complementary technologies will help mitigate risk Flash at the macro/array/volume/ host group/protection group Zerto at the granular/vm/vm groups/ cluster level 13 SOLUTION: DECREASE COSTS Streamline data with duping & compression Reduce workload on overseas pipeline Enable hardware-agnostic replication and DR (any any) Test out adding new functionality to arrays Support multi-site, multi-country with minimal performance impact 14

SOLUTION: STREAMLINE COMPLEXITY Free up headcount from DR and replication Automatically re-map storage, no

pipeline Streamline refreshes Audit support during business hours Restoring applications in less time 15

8 SOLUTION: STREAMLINE COMPLEXITY Free up headcount from DR and replication Automatically re-map storage, no manual Simplify other key time sucks Automatically de-dupe and compress data Reduce need for overseas pipeline Streamline refreshes Audit support during business hours Restoring applications in less time 15 SOLUTION: MEET OR EXCEED REQUIREMENTS Enable new capabilities in key workloads Full live failover and failback Small data v. fat data Refreshes take minutes instead of hours or days Enable near-sync continuous replication as well as snapshots Streamline restores with Point in Tine Recovery journal Deliver RPOs in seconds, RTOs in minutes 16

9 HYPERVISOR BASED DR IN ACTION ZERTO VIRTUAL REPLICATION ARCHITECTURE vcenter Server vcenter Server 18

ZERTO VIRTUAL REPLICATION ARCHITECTURE Highly Scalable Software only, hypervisor based, downloadable Replicate from anything to

Bandwidth Optimization, WAN resiliency Point-in-Time Recovery - Recover from Logical Failures Journal based any point in time recovery

10 ZERTO VIRTUAL REPLICATION ARCHITECTURE Highly Scalable Software only, hypervisor based, downloadable Replicate from anything to anything save cost and reuse HW vcenter Server vcenter Server RPO = Seconds No App Performance Impact Near-sync, continuous replication Bandwidth Optimization, WAN resiliency Point-in-Time Recovery - Recover from Logical Failures Journal based any point in time recovery - No snapshots 19 APPLICATION PROTECTION: VIRTUAL PROTECTION GROUP Complete application protection and recovery Application SharePoint, CRM, ERP, Exchange etc. Virtual Protection Group VM & VMDK level consistency groups Protect across server and storage locations Fully support VMotion, Storage VMotion, HA, vapp Journal-based point-in-time protection Group policy and configuration VSS Support REPLICATION SITE 20

AUTOMATION: FAILOVER, FAILBACK, RECOVERY RTO = Minutes!

reconfiguration, test networks and more Click-to-Test, Anytime Immediate, automated, failover testing while protecting production,

ACCEPTED RECOVERY End-user acceptance testing with the ability to rollback a failover automatically Validates prior to production

11 AUTOMATION: FAILOVER, FAILBACK, RECOVERY RTO = Minutes! Fully automated failover and failback of multiple VMs with write-order fidelity, including parallel VM recovery, boot order, IP reconfiguration, test networks and more Click-to-Test, Anytime Immediate, automated, failover testing while protecting production, also to previous point in time Offsite Cloning Clone entire app offsite for test & dev or backup 21 WORKFLOW AUTOMATION - END-USER ACCEPTED RECOVERY End-user acceptance testing with the ability to rollback a failover automatically Validates prior to production release of application Simply recover from logical failures Ability to automate the commit or failback event Reduce operational complexity with workflow Significantly reduces the time it takes to reverse a failover activity 22

BEFORE COMPLEX, MANUAL REPLICATION PROCESS Example - Current replication configuration process for virtualized CRM Virtualiza/on Team Storage Team Local vcenter Locate all VMs Map & Document Zza

VM Consolidate CRM VMs Datastores on separate LUN Storage Move all other app Consolidate all CRM Document all VMs to other LUNs VMs to same LUN LUN proper/es Manual!

Ensure sufficient Allocate LUNs in replica space for replica with same proper/es Replica/on Management Remote vcenter Create and document recovery plan Verify replica/on Test recovery plan On going

12 BEFORE COMPLEX, MANUAL REPLICATION PROCESS Example - Current replication configuration process for virtualized CRM Virtualiza/on Team Storage Team Local vcenter Locate all VMs Map & Document Zza affec/ng CRM All LUNs Locate all Complex! VM Consolidate CRM VMs Datastores on separate LUN Storage Move all other app Consolidate all CRM Document all VMs to other LUNs VMs to same LUN LUN proper/es Manual! Configure all replica/on pairs and en//es Remote Storage Inflexible! Ensure sufficient Allocate LUNs in replica space for replica with same proper/es Replica/on Management Remote vcenter Create and document recovery plan Verify replica/on Test recovery plan On going monitoring 23 AFTER AUTOMATED REPLICATION PROCESS Example - Current replication configuration process for virtualized CRM Virtualiza/on Team Local vcenter Locate all Locate VMs all VMs affec/ng affec/ng CRM CRM Locate all VM Datastores Configure Map & Document verify Zza replica/on All and LUNs policies Consolidate CRM VMs on separate LUN Remote vcenter Create and document recovery plan Test recovery plan On going replica/on monitoring Storage Team Storage Move all other app VMs to other LUNs Replica/on Management Remote Storage Consolidate all CRM VMs to same LUN Ensure Ensure sufficient space space for for replica replica Configure all replica/on pairs and en//es Document all LUN proper/es Allocate Allocate space LUNs for in replica all with replicated same proper/es VMs Verify replica/on On going monitoring 24

13 SUMMARY SUMMARY Business requirements: Driving demand toward Operational Resiliency model, including Elmer s requirements: reduce costs, complexity, & resources assure service level capabilities Traditional DR HA within & between sites Metrics for capacity, performance & availability 26

14 QUESTION AND ANSWER Peter Laz Brendan Foye: 27