Working with Nutanix Metro Availability – part 1
Nutanix provides several ways to protect workloads running on its platform. Depends on the business requirements customer can choose between a variety of the available options. If an application requires RPO=0, the customer can leverage Nutanix Metro Availability (synchronous replication). If RPO =0 is not required, the customer can choose between Nutanix Asynchronous replication or NearSync replication.
This blog series will focus on Nutnix Metro Availability (metro stretch clusters). Starting with Nutanix AOS 5.9, Nutanix Metro Availability is available Microsoft Hyper-V 2016.
On a high level, Nutanix Metro Availability consists of two, independent Nutanix and VMware vSphere clusters sitting in two datacenters. However, from vCenter view, you can see it as one logical compute (vSphere) and storage (Nutanix) cluster.
Nutanix Metro Availability requirements:
- Connectivity latency between datacenters must be below 5ms RTT.
- VMware vSphere or Microsoft Hyper-V 2016 as Hypervisor
NOTE: with current AOS release (AOS 5.9.1) , Metro Availability Witness is supported with VMware vSphere only
- stretched L2 network across two datacenters
Someone may ask, where is vCenter or Prism Central? Good news is neither of those two is a requirement for Nutanix MA to work properly. In the event of an unplanned failover, Cerebro (this is the service responsible for data replication) is registering VM (including vCenter and Prism Central VM) on survived site directly on ESXi hosts. You do not have to worry about heaving management plane available in the event of an unplnned failover.
How does data in Nutanix Metro Availability is being written to the system?
When the write comes from VM to Nutanix AOS, Nutanix AOS writes two copies of the data in Nutanix cluster on SITE A, right after sends the same data to Nutanix cluster on SITE B. Nutanix AOS on SITE B creates two copies of the data and sends back acknowledge to Nutanix cluster on SITE A. Cluster on SITE A sends acknowledge back to VM only when all 4 copies have been written to the Metro Availability clusters (2 copies per cluster)
Nutanix MA solution capabilities
- You can run active workloads on both sites
- You can vMotion virtual machines with zero downtime across both sites
- You can perform planned failover with zero downtime for virtual workloads
- You can setup replication to 3’rd site for workloads protected by Nutanix MA
- Using Nutanix Witness failover between sites can be fully automated
Failure Handling modes
Nutanix Metro availability has 3 failure handling modes:
With failure handling set to manual, VM writes to the active container do not resume until either network connectivity is restored between the clusters or an administrator manually disables Metro Availability. The benefit of this setting is that it enables strict synchronous replication Metro Availability between the sites. The downside is that applications within the VMs can timeout when writes are held.
With failure handling set to automatic restore, if network connectivity is restored within the timeout period (10 seconds by default), the replication state immediately returns to enabled and writes continue to replicate synchronously between the clusters. If the timeout period expires,
Metro Availability is disabled automatically and writes resume, but they only propagate to the active container. When Metro Availability has been disabled, replication to the standby container does not occur until both network connectivity is restored and you manually perform a reenable
With failure handling set to witness, if network connectivity is restored within the timeout period (10 seconds by default), the replication state immediately returns to enabled and writes continue to replicate synchronously between the clusters. If the timeout period expires, the cluster that owns the active protection domain attempts to acquire the witness lock. If the active protection domain successfully acquires the lock, Metro Availability is disabled automatically and writes resume, but they only propagate to the active container. During this time, the cluster that owns the standby protection domain also detects the network interruption. After 20 seconds (by default), it attempts to obtain the witness lock. As the lock was already acquired by the active site, the standby site fails to obtain it and becomes inactive. When witness coordination has automatically disabled Metro Availability as described above, replication to the standby container stops until both network connectivity is restored and you manually perform a reenable operation.
Configure Nutanix Metro Availability
- Create containers with the same name on both clusters
- Add cluster from SiteA to Cluster from SiteB and vice-versa
- In Prims go to –> Data Protection –> +Remote Site –> Physical Cluster. Provide remote cluster IP address (not CVM address)
- On next screen map containers, you have created earlier and save config
- Create the Protection domain
- In Prims go to –> Data Protection –> + Protection Domain –> Metro Availability
- Name – provide a meaningful name – my advise is always to ensure replication direction and protection domain type: SiteA-SiteB-MA-01
- Storage container – choose a storage container from the list
- Remote Sites – pick up a compatible cluster
- Failure handling – choose one of three failure handling
Nutanix Metro Availability and 3’rd copy of a data
Virtual workloads which are protected by Metro Availability can have an additional level of protection by adding a copy to the 3’rd site. In below configuration workloads are protected by Nutanix MA (RPO=0) between SITE A and SITE B and by asynchronous replication (RPO=1h or greater) to SITE C. The advantage of below solution is you have “offline” copy of your data in the 3’rd site you can recover from in case of disaster.
NOTE: Disaster scenario is not only datacenter or cluster level failure. It can be human error (wrongly written and not tested script which deletes data from VMs or VMs from Nutanix cluster), a virus, massive data corruption on VMs.
You may ask, why do I need a 3’rd copy of a data if I already have workloads protected by Nutanix MA? Nutanix MA is synchronous replication, which means whatever block of data is created or deleted on one site is automatically deleted or created on the second site. If someone deletes VM on SITE A, guess what? Yes, VM is gone on the other site too. Same story when there is a virus, data corruption, ransomware attack.
NOTE: To protect data against cases I listed above, 3’rd site is not mandatory. Nutanix data protection gives you a possibility to retain a number of snapshots or both sites you can recover from. The flip side of it is, capacity planning, each snapshot takes free space from the cluster. Proper capacity planning, in this case, is a key to keep your solution fully operational during normal BAU (Business As Usual) and during the failover event.
How to configure Nutanix to replicate data to 3’rd site
Adding 3’rd data copy to existing Metro Availability set up is very easy. First, make sure have 3’rd remote site configured with enough capacity to hold replicated data from one or both metro sites. In Prism Element on Nutanix cluster on SITEA go to Data Protection section –> Metro Availability tab –> Update MA Protection domain –> Add new schedule and choose cluster on SITEC as a target. Configure schedule by setting up snapshot interval and snapshot retention policy. When you finish system automatically kick off replication to Nutanix Cluster on SITEC.