Monday, October 10, 2016

Virtual SAN Availability Part 7 - Degraded Disk Handling (DDH)

Degraded Disk Handling (DDH)


While this blog series focuses on availability, performance is certainly worth mentioning. In many cases, a poorly performing application or platform can be the equivalent of offline. For example, excessive latency (network, disk, etc.) can cause a database query to take much longer than normal. If an end-user expects query results in 30 seconds and suddenly it takes 10 minutes, it is likely the end-user will stop using the application and report the issue to IT - same result as the database being offline altogether.

A cache or capacity device that is constantly producing errors and/or high latencies can have a similar negative effect on a Virtual SAN (VSAN) cluster. This can impact multiple workloads in the cluster. Prior to VSAN 6.1, a badly behaving disk caused issues in a hand-full of cases, which led to another VSAN availability feature. It is commonly called Dying Disk Handling, Degraded Disk Handling, or simply "DDH".

Virtual SAN (VSAN) 6.1 and newer versions monitor cache and capacity devices for issues such as excessive latency and errors. These symptoms can be indicative of an imminent drive failure. Monitoring these conditions enables VSAN to be proactive in correcting conditions such as excessive latencies, which negatively affects performance and availability. Depending on the version of VSAN you are running, you might see varying responses to disks that are behaving badly.

DDH in VSAN 6.1


Version 6.1 simply looks for a sustained period of high read and/or write latencies (greater than 50ms). If the condition exists for longer than 10 minutes, VSAN will issue an alarm and unmount the disk. If the disk with excessive latency is a cache device and VSAN takes it offline, the entire disk group is unmounted. In other words, all capacity devices in the disk group are also unmounted. As you can imagine, this impacts a larger number of components in many cases. Here is an example of this happening with an SSD:

2015-09-15T02:21:27.270Z cpu8:89341)VSAN Device Monitor: WARNING – READ Average Latency on VSAN device naa.6842b2b006600b001a6b7e5a0582e09a has exceeded threshold value 50 ms 1 times.
2015-09-15T02:21:27.570Z cpu5:89352)VSAN Device Monitor: Unmounting VSAN diskgroup naa.6842b2b006600b001a6b7e5a0582e09a

Components on this disk group will be marked as "Absent" and the rebuild of these components on disks in other disk groups will begin if after the 60-minute rebuild timer has expired (see Part 4 of this series for more information on component states and the rebuild timer).


If an object is not protected by either RAID-1 mirroring or RAID-5/6 erasure coding and it has a component on the unmounted disk group, that object will be inaccessible. This scenario highlights another reason why FTT should be configured greater than zero in nearly every case and a proper data protection strategy is a must. We will get into backup, replication, and disaster recovery later in this series.

Taking a disk or entire disk group offline can be somewhat disruptive and sometimes requires the rebuild data. This is something that all storage platforms avoid unless absolutely necessary. With VSAN 6.1, the criteria - errors and/or latency for 10 minutes - was not as selective as it should have been. There are a few cases where the issue is transient. A disk might produce high latencies for 15 minutes and return to normal performance levels. We want to avoid initiating the movement of large amounts of data in cases like this, which prompted some changes in VSAN 6.2.

DDH in VSAN 6.2


VSAN 6.2 includes four enhancements to improve the reliability and effectiveness of DDH:

1. DDH will not unmount a VSAN caching or capacity disk due to excessive read IO latency. Only write IO latency will trigger an unmount. Taking a disk offline and evacuating all of the data from that disk is usually far more disruptive than a sustained period of read IO latency. This change was made to reduce the occurrence of "false positives" where read latency rises beyond the trigger threshold temporarily and returns to normal.

2. By default, DDH will not unmount a caching tier device due to excessive write IO latency. As discussed above, taking a cache device offline causes the unmount of the cache and all capacity devices in the disk group. In nearly all cases, excessive write IO latency at the cache tier will be less disruptive than taking an entire disk group offline. DDH will only unmount a VSAN disk with excessive write IO latency if the device is serving as a capacity device. This global (affects all VSAN disks) setting can be overridden via ESXi command:

esxcfg-advcfg --set 1 /LSOM/lsomSlowTier1DeviceUnmount

Running the command above will instruct VSAN to unmount a caching tier device with excessive read and/or write IO latency.

3. DDH tracks excessive latency over multiple, randomly selected 10-minute time intervals instead of using a single 10-minute interval. This improves the accuracy and reliability of DDH to reduce the occurrence of false positives. Transient elevations in IO from activities such as VSAN component recovery, sector remapping for HDDs, and garbage collection for SSDs should no longer cause DDH false positive issues. To further improve DDH accuracy, latency must exceed the threshold for four, non-consecutive, 10-minute time intervals that are randomly spread out over a six to seven hour time period.

4. DDH attempts to re-mount VSAN disks in failed state or disks previously unmounted by DDH. DDH will attempt to re-mount a disk under these conditions approximately 24 times over a 24-hour period. The re-mount attempt will fail if the condition that caused the disk to go offline is still present.

DDH Indicators


There are a few things to look for to figure out if DDH has kicked in:

A vmkernel.log log message indicating that a disk or disk group has been unmounted...

2016-10-10T10:10:51.481Z cpu6:43298)WARNING: LSOM: LSOMEventNotify:6440: Virtual SAN device 52db4996-ffdd-9957-485c-e2dcf1057f66 is under permanent error.
.
.
.
2016-10-10T10:17:53.238Z cpu14:3443764)VSAN Device Monitor: Successfully unmounted failed VSAN diskgroup naa.600508b1001cbbbe903bd48c8f6b2ddb

A reference to "failed" (above) as opposed to "unhealthy" in the vmkernel.log message indicates that a disk failure was detected as opposed to a disk with sustained high latency.

A hostd log message indicating that a disk or disk group has been unmounted...

event.Unmounting failed VSAN diskgroup

A vCenter Server event message indicating that a disk or disk group re-mount was not successful...

eventTypeId = "Re-mounting failed VSAN diskgroup  naa.600508b1001cbbbe903bd48c8f6b2ddb.",

One last thing to note is when deduplication and compression is enabled, a DDH issue with any disk in the disk group (cache or capacity) will cause an unmount of the entire disk group.

Up Next...

We will change things up in the next few articles and cover Virtual SAN stretched clusters - Part 8 provides an introduction.

@jhuntervmware

No comments:

Post a Comment

Note: Only a member of this blog may post a comment.