Friday, September 30, 2016

Virtual SAN Availability Part 6 - Maintenance Mode

Planned Downtime


The last few articles in this series focused on unplanned downtime. While we still have more to cover there, let's briefly shift focus to planned downtime. The primary example of planned downtime is host maintenance. There are a number of reasons a vSphere host might need to be taken offline such as firmware updates, storage device replacement, and software patches.

vSphere has a feature designed specifically for these types of activities. It is called "maintenance mode". When a host is put into maintenance mode, vSphere automatically evacuates the running virtual machines to other hosts in the cluster. This is done with vMotion so that virtual machine downtime is not incurred. This can take just a few minutes or several minutes depending on factors such as the number of virtual machines that must be migrated and vMotion network speed. Once all of the virtual machines have been evacuated, the host enters maintenance mode and work on that host can begin.


Virtual SAN introduces another consideration, which is the utilization of local storage devices inside of each host. These devices contain components that make up Virtual SAN objects. Shutting down or rebooting a host naturally makes these components inaccessible until the host is back online. Let's take a closer look at how vSphere's maintenance mode has been enhanced for Virtual SAN clusters.

Maintenance Mode with Virtual SAN

When a host that is part of a Virtual SAN cluster is put into maintenance mode, the administrator is given three options concerning the data (Virtual SAN components) on the local storage devices of that host. The option selected has a bearing on a couple of factors: The level of availability maintained for the objects with components on the host and the amount of time it will take for the host to enter maintenance mode. The options are:
  • Ensure accessibility (default)
  • Full data migration
  • No data migration
To help illustrate the difference between these maintenance mode options, we will use diagrams that provide examples of components placement. This first diagram below shows a four-node cluster and only two objects to keep things simple. One object is assigned a storage policy with FTT=1. The three components - two data, one witness - that make up this object are colored green. The second object is assigned a storage policy with FTT=0. The second object has only one component due to the FTT=0 assignment and that component has a dark orange color in the diagrams.



Ensure Accessibility

Ensure accessibility instructs Virtual SAN to migrate just enough data to ensure every object is accessible after the host goes into maintenance mode. The level of availability protection might be reduced for some objects. This next diagram shows component placement after putting the first host into maintenance mode with the ensure accessibility option.



The orange component is migrated to another host to ensure access. Since there is already a green data component accessible on the fourth host and the witness for that object is on the second host, the object is accessible even though the green component on the first host was not migrated. The green component on the first host is offline and marked as Absent in the Virtual SAN UI.

When the first host is back online, the green component on that host will be updated with any changes made to the green component on the fourth host. In other words, the two green data components will be synchronized to bring the object back into compliance with the FTT=1 rule.

The ensure accessibility option is appropriate for scenarios where a host will be offline for a short period of time, e.g., a quick patch and host reboot. This option minimizes the amount of data that must be moved while maintaining at access to an object. Keep in mind the level of availability might be reduced for objects that have components on the host in maintenance mode. For example, the object composed of the green components would be inaccessible if the fourth host goes offline while the first host is in maintenance mode in the scenario above.

Keep in mind the rebuild timer (60 minutes by default) is in effect for all maintenance mode options. VSAN will start rebuilding absent components located on a host that is in maintenance mode for a period of time longer than what the rebuild timer is set to. Recommendation: If maintenance is going to take longer than the rebuild timer value, select the full data migration option (discussed next).

Full Data Migration

As you might expect from the name of this option, all data is migrated from the host going into maintenance mode. This option is best for cases where a host is going to be offline for a longer period of time or permanently decommissioned. It is also appropriate in cases where the number of failures to tolerate for objects must not be reduced.


This option maintains compliance with the number of failures to tolerate, but requires the more time as all data is migrated from the host going into maintenance mode. It usually takes longer for a host to enter maintenance mode with full data migration versus the ensure accessibility option.

Also note that with smaller clusters, it might not be possible for Virtual SAN to maintain compliance with an FTT rule. Consider a cluster with three hosts. If one of the hosts is put into maintenance mode, there would only be two hosts providing access to components. That is one less than the minimum number of hosts needed for the object with green components to be compliant with the FTT=1 rule. The object would be accessible with two of the three hosts online. The object would become inaccessible if a host failed while the other is in maintenance mode (only one of the three hosts online). Recommendation: Build a cluster with at least four hosts if the cluster will be running workloads the must be highly available at all times (including maintenance windows).

No Data Migration

No data is migrated when this option is selected. A host will typically enter maintenance mode quickly with this option, but there is risk if any of the objects have a storage policy assigned with FTT=0. As seen in the diagram below, the object with green components will remain accessible, but the object with the orange component will be offline.


This option is best for short amounts of planned downtime where all objects are assigned a policy with FTT=1 or higher or downtime of objects with FTT=0 is acceptable.

In the Part 7 of this series, we look at how Virtual SAN handles a storage device that is in a degraded state.

@jhuntervmware

2 comments:

  1. Hi Jeff,

    Does the 60 minutes rebuild timer come into play with maintenance mode (ensure accessibility/no data migration) ?

    Regards,

    ReplyDelete
    Replies
    1. Yes, the 60-minute rebuild timer is still in effect. If a host is in maintenance mode for longer than 60 minutes, VSAN will start the rebuild process. The recommendation is to select the full data migration option when a host will be in maintenance mode for a longer period of time. Thank you for asking. I will add this information to the blog article.

      Delete