Wednesday, November 12, 2025

Introduction to VMware vSAN Data Protection

VMware vSAN Data Protection, introduced in vSphere 8.0 Update 3, is a powerful capability included in the VMware Cloud Foundation (VCF) license for customers running vSAN ESA. This solution provides local protection for virtual machine workloads at no additional licensing cost. Key capabilities include:

  • Define protection groups and retention schedules.
  • Schedule crash-consistent snapshots of VMs at regular intervals.
  • Create immutable snapshots to help recover from ransomware.
  • Retain up to 200 snapshots per VM.
  • Per-VM replication locally and/or to remote clusters.

The simplicity and tight integration of the tool make it a terrific way to augment existing data protection strategies or serve as a basic, but effective solution for protecting virtual machine workloads. The images below show the UI and provide some context. You can click the images to get a better look.






vSAN Data Protection simplifies common recovery tasks, giving administrators the ability to restore one or more VMs directly within the vSphere Client, even if they have been deleted from inventory—a capability not supported by historical VMware snapshots. Getting started involves deploying the VMware Live Recovery (VLR) virtual appliance and configuring protection groups, which can also be set to create immutable snapshots for basic ransomware resilience.

For extended functionality, VCF 9.0 introduces a unified virtual appliance and the option for vSAN-to-vSAN replication to other clusters for disaster recovery, which is available with an add-on license. More information about VMware Live Recovery: https://www.vmware.com/products/cloud-infrastructure/live-recovery 

Friday, October 24, 2025

VMware vSAN Daemon Liveness - EPD Status Abnormal

I reinstalled ESXi and vSAN ESA 8.0 on hosts that were previously running ESX and vSAN ESA 9.0. It's a lab environment. I did not bother to delete the previous VMFS volume on a local drive in each host. This is where ESXi was storing scratch data, logs, etc. After reinstalling 8.0 and turning on vSAN, I noticed the following error in vSAN Skyline Health: vSAN daemon liveness.

vSAN daemon liveness

I expanded the health check to get more information. There I could see Overall Health, CLOMD Status, EPD Status, etc. EPD Status showed Abnormal on all of the hosts. I naturally did some searching online and came up with some clues including the knowledge base (KB) articles below.

https://knowledge.broadcom.com/external/article?articleNumber=318410

They were helpful, but did not get me to resolution. The post below from a few years ago got me closer.

https://community.broadcom.com/vmware-cloud-foundation/discussion/vsan-epd-status-abnormal

I browsed the existing VMFS volume and noticed the  epd-storeV2.db file in the .locker folder.

epd-storeV2.db

Since it is a lab environment, I figured why not just delete the file and see if vSAN heals itself (recreates the necessary files). I put the host in maintenance mode as a precaution, deleted the file, and rebooted the host. This resolved the issue. In addition to the .db file, I noticed the addition of a new file,  epd-storeV2.db-journal.

epd-storeV2.db-journal

I checked vSAN Skyline Health and the error was gone for that host. You can see the status of each host if you click the Troubleshoot button for the vSAN daemon liveness health check. I repeated that effort for each host in the cluster.

The vSAN Skyline Health error was gone after completing the process on all of the hosts in the cluster.

I probably could've restarted services on the hosts as detailed in the KB article above, but I chose to reboot them since they were already in maintenance mode.

Monday, October 6, 2025

VMware VCF Installer Warning - Evacuate Offline VMs Upgrade Policy

I ran across this warning in the VMware Cloud Foundation (VCF) Installer when deploying VCF 9:

cluster <cluster name>: Evacuate Offline VMs upgrade policy configured for cluster <cluster name> on vCenter does not match default SDDC Manager ESXi upgrade policy. It has value false in vCenter, and default value true in SDDC Manager.

The fix is simple...

In the vSphere Client, click the three horizontal lines in the top left corner.

Click Lifecycle Manager.

Click Settings.

Click Images.

Click the Edit button for Images on the right side of the window.

Click the checkbox next to "Migrate powered off and suspended VMs to other hosts in the cluster, if a host must enter maintenance mode" to check the box.

Migrate powered off and suspended VMs to other hosts in the cluster, if a host must enter maintenance mode

Click Save.

Re-run the VCF Installer validations.

Tuesday, August 12, 2025

VMware Live Recovery Not Accessible Error

  • VMware Live Recovery 9.0.3.0 and later combine the control plane, replication, and data protection into a single appliance.
  • Ensure DNS--forward and reverse lookup--are functioning correctly.
  • A "Not accessible" error could indicate an issue with certificates.
I recently had the opportunity to set up the latest VMware Live Recovery appliance, which simplifies deployment and configuration into a single virtual appliance. More details about this new model can be found in this blog article.

I encountered an issue where the UI displayed vSphere Replication and VMware Live Site Recovery, formerly Site Recovery Manager (SRM), was marked as "Not accessible."


I double-checked that DNS was correctly configured, and I was able to ping all directions between the vCenter instances and the VMware Live Recovery appliances. It was not a name resolution or network connectivity issue. Pro tip: Always verify DNS is configured correctly. I've seen this cause issues in countless scenarios.

After some searching and trial and error, I determined the issue was a mismatch between the server name and what was in the certificate. Specifically, the Common Name (CN) in the Issued to section. I logged into the appliance management UI by appending port 5480 to the fully-qualified domain name (FQDN) of the server. Using the example FQDN in the screenshot, the URL would be https://vlr01.vmware.lab:5480.

After logging in with the "admin" username and the password I specified for that account during deployment, I selected Certificates in the left menu bar. Then, I clicked Change near the top right corner to generate a new self-signed certificate.


I completed the fields in the Change certificate UI and clicked the Change button to generate the new certificate. I verified the CN matched the FQDN and rebooted the VMware Live Recovery virtual appliance just for good measure. When the server came back up, the error was gone, and I had the green checkmark OK indicator.



Thursday, July 17, 2025

Technical Introduction to vSAN ESA 2-Node Clusters

  • VMware vSAN can be deployed in a 2-node cluster.
  • vSAN ESA 2-node clusters provide excellent performance and availability in a small form factor.
  • A vSAN Witness Host virtual appliance enables protection against "split-brain."

Introduction

VMware's vSAN Express Storage Architecture (ESA) has streamlined hyperconverged infrastructure (HCI) by optimizing performance and efficiency. A particularly compelling deployment model within this architecture is the 2-node cluster. This setup offers a high-availability solution ideal for small sites and edge computing environments where space and hardware resources are limited.

Core Architecture and Requirements

A vSAN ESA 2-node cluster is a specialized configuration that provides data redundancy and high availability with just two physical hosts at the primary site. Unlike larger clusters, it doesn't require a minimum of three nodes. Instead, it relies on a virtual machine appliance called a vSAN Witness Host located on a host other than the two physical nodes.

VMware vSAN 2-node Cluster

Key Components

  1. Two Physical Nodes: These are two physical ESXi hosts that run virtual machines and store the data associated with those virtual machines. In ESA, these nodes must be all-NVMe, utilizing high-performance, enterprise-class NVMe storage devices for both caching and capacity. There are no traditional vSAN disk groups; instead, there's a single, flexible storage pool.

  2. Witness Host: The witness is a virtual machine appliance that resides on a third host, typically at a primary data center, when deploying a 2-node cluster in a remote site. Its primary role is to act as a tiebreaker in the event of a failure or network partition between the two data nodes. The witness host stores only metadata in the form of witness components. It does not run virtual machines or store virtual machine data (VMDK files).

Networking Essentials

Proper networking is critical for a stable 2-node cluster.

  • Data Node Interconnect: The two data nodes need a high-speed, low-latency network connection, usually a direct link or dedicated switch. At least 10 Gbps is required to support vSAN traffic and vMotion effectively. 25Gbps or higher is recommended.
  • vSAN Witness Host Traffic: The connection between the physical hosts and the vSAN Witness Host has specific requirements. While the latency requirements are less strict than for stretched clusters, they are still important.

    • Latency: The round-trip time (RTT) to the witness should be less than 500 ms.

    • Bandwidth: A 2 Mbps connection is generally sufficient, as only metadata is transmitted.

    • Traffic: It's best practice to tag the witness traffic on a separate VMkernel adapter to keep it isolated from other network traffic.

Data Placement and Protection

In a vSAN ESA 2-node cluster, data protection is achieved through mirroring. The default storage policy for a 2-node cluster is RAID 1 (Mirroring).

Here's how it works:

  1. A virtual machine's disk object (VMDK) is created on the vSAN datastore.

  2. The object has two complete copies, or replicas.

  3. One replica is placed on Node A, and the second replica is placed on Node B.

  4. A small witness component is created and placed on the witness host.

This creates a data layout of Replica 1, Replica 2, and Witness. This configuration ensures that if one data node fails, a complete copy of the data is still available on the surviving node. The witness component ensures that there is a quorum () of components available to keep the VM object online.

For example, if Node A fails, Node B still has a complete data replica, and the witness host provides the third component vote. The cluster recognizes that a valid, complete copy of the data exists and keeps the VM running (after a vSphere HA restart) on Node B.

Handling Failures: Split-Brain Scenario

The primary function of the witness is to prevent a "split-brain" scenario. Imagine the direct network link between Node A and Node B fails, but both nodes can still communicate with the witness host.

  • Without a witness, both nodes would think the other is offline and would attempt to take sole ownership of the virtual machines, leading to data corruption.

  • With a witness, both Node A and Node B will attempt to place a lock on the witness components for their VMs. The node that successfully acquires the lock gains ownership of the VM objects and continues writing to its replica, preventing a split-brain condition and ensuring data integrity.

Summary

vSAN ESA 2-node clusters offer a robust and efficient high-availability solution for small environments and edge solutions such as remote offices. By utilizing the power of NVMe hardware and a simple yet effective witness architecture, it delivers enterprise-grade speed and resilience in a compact form factor.