Monday, November 16, 2015

Recovering a Replicated vSphere Data Protection (VDP) Virtual Appliance

The question: Can a vSphere Data Protection (VDP) virtual appliance be successfully recovered from a replicated copy?

“Successfully recovered” means not only recovering the appliance, but being able to restore a VM from that recovered appliance. To test out this scenario, I used vSphere Replication (VR) to provide a replicated copy of the VDP virtual appliance. I do not recommend using VR to replicate VDP. In this case, I simply needed a way to test whether the VDP appliance could be recovered from the replicated copy at a secondary site – more importantly, to see if I could actually use the recovered VDP appliance at the secondary site to recover a VM that was backed up at the primary site. If you want the quick answer, scroll down to the end of this article for my recommendations. If you want more details...

I deployed a 2TB VDP virtual appliance at the primary site – a total of 13 .vmdk files, thin provisioned. I created a scheduled backup job for one VM. I protected the VDP appliance with VR and set the RPO to 12 hours. This helped me avoid issues with replicating VDP using VR as discussed in that previous post. Once the initial replication was complete, I started my test.

To begin, I powered off the original VDP appliance and recovered the replicated copy. A VM is recovered by VR with the virtual network interface card (vNIC) disconnected as a safety precaution. I enabled the vNIC and powered on the recovered VDP appliance. I was not too surprised when I saw the following message.

Also notice in the background that the Core Services are reporting as “Unrecoverable”. This was almost surely a result of the replicated VDP appliance being a crash-consistent copy – basically as if someone “pulled the plug” on the server, moved it to another location and then turned it back on. VDP does not take kindly to that. Always shut down the guest OS of a VDP appliance gracefully (do not power it off).

There was not much I could do with the recovered appliance in that state so I deleted it from disk. However, I was not ready to give up. This time, I let the scheduled backup job and replication run for a few days. This was primarily so the VDP appliance would start running integrity checks and create checkpoints. Having a validated checkpoint would give me the opportunity to perform a Rollback in VDP. The Rollback mechanism is designed to protect against exactly what I ran into with the first attempt – a corrupted appliance due to a crash-consistent recovery.

Once again, I recovered the VDP appliance at my secondary site using VR. The VDP appliance came online in good health – I did not have to perform a Rollback. Looking at the VDP Configure user interface (UI), I found the maintenance services were stopped, but all other services were running. I switched over to the VDP appliance console and noticed it was validating an integrity check.

Once the integrity check was complete, I switched back to the VDP Configure UI and found the Maintenance services were now running – presumably started after the integrity check was completed. Using the VDP Configure UI, I connected it to the vCenter Server and SSO server at the secondary site and rebooted the appliance.

The VDP appliance went through the automated reconfiguration process. After several minutes had passed, VDP was available for use in the vSphere Web Client. Nice! However, when I tried to connect to the appliance, I received the error stating that “the most recent request has been rejected – most likely a time issue”. I just posted an article on this issue a few days ago:

After some quick troubleshooting, I found that my vCenter Server virtual appliance, which contains the SSO server, was about 3 minutes behind the VDP appliance and the vSphere host they were running on. I restarted VMware Tools at the command line on the vCenter Server appliance: “service vmware-tools-services restart”. When VMware Tools restarted, the time was updated and all clocks were in sync.

As the earlier “Switch to a new vCenter?” warning stated, no backup jobs were present on the recovered VDP appliance, but I did find restore points for the VM backed up at original site. Naturally, I kept moving by creating a restore job from the most recent restore point for the VM. The job started, but after a few moments, I received the following error:

“VDP: Failed to restore client db-server, Execution error: E10050:Failed to create Virtual Machine.” (db-server being the name of the VM I was attempting to recover)

I speculated that maybe there was some corruption so I went ahead and performed a Rollback in the VDP Configure UI to the validated checkpoint. I tried the restore again and received the same disappointing error message. Turns out this was an issue related to performance. I had several VMs running on this same vSphere host – memory was almost completely consumed and CPU utilization was also at a considerable level. I shut down some of the non-essential VMs, which freed up some memory and CPU, and tried the restore again – it worked!

So what does all of this mean? As detailed above, you can see that I did not do a lot of testing with various replication types, different workload sizes, etc.

Here are my recommendations:

- Keep it simple. Use the backup data replication engine built into VDP 6.x. It replicates only unique data segments after backup data has been deduplicated, only backup data is replicated (not the entire appliance), and the replicated data is encrypted on the wire for security. VDP 6.x supports quite a few versions of vCenter Server and vSphere prior to version 6.0 - check the VMware Certified Compatibility Guides for details.

- VR is NOT recommended for replicating a VDP appliance. The replicated copy is crash-consistent, which presents a risk of data corruption. There is no way to schedule when VR performs replication and there is no automated way to replicate a powered off – i.e. completely quiesced – VDP appliance using vSphere Replication.

- As with most applications, make sure VDP has the CPU and memory resources it needs. If your VDP appliance is particularly busy (backing up many VMs, frequent restore jobs, long-running integrity checks, etc.), it might make sense to configure more memory and/or CPU and memory reservations for the appliance.


No comments:

Post a Comment