Monday, October 10, 2016

Virtual SAN Availability Part 7 - Degraded Disk Handling (DDH)

Degraded Disk Handling (DDH)

While this blog series focuses on availability, performance is certainly worth mentioning. In many cases, a poorly performing application or platform can be the equivalent of offline. For example, excessive latency (network, disk, etc.) can cause a database query to take much longer than normal. If an end-user expects query results in 30 seconds and suddenly it takes 10 minutes, it is likely the end-user will stop using the application and report the issue to IT - same result as the database being offline altogether.

A cache or capacity device that is constantly producing errors and/or high latencies can have a similar negative effect on a Virtual SAN (VSAN) cluster. This can impact multiple workloads in the cluster. Prior to VSAN 6.1, a badly behaving disk caused issues in a hand-full of cases, which led to another VSAN availability feature. It is commonly called Dying Disk Handling, Degraded Disk Handling, or simply "DDH".

Virtual SAN (VSAN) 6.1 and newer versions monitor cache and capacity devices for issues such as excessive latency and errors. These symptoms can be indicative of an imminent drive failure. Monitoring these conditions enables VSAN to be proactive in correcting conditions such as excessive latencies, which negatively affects performance and availability. Depending on the version of VSAN you are running, you might see varying responses to disks that are behaving badly.

Friday, September 30, 2016

Virtual SAN Availability Part 6 - Maintenance Mode

Planned Downtime

The last few articles in this series focused on unplanned downtime. While we still have more to cover there, let's briefly shift focus to planned downtime. The primary example of planned downtime is host maintenance. There are a number of reasons a vSphere host might need to be taken offline such as firmware updates, storage device replacement, and software patches.

vSphere has a feature designed specifically for these types of activities. It is called "maintenance mode". When a host is put into maintenance mode, vSphere automatically evacuates the running virtual machines to other hosts in the cluster. This is done with vMotion so that virtual machine downtime is not incurred. This can take just a few minutes or several minutes depending on factors such as the number of virtual machines that must be migrated and vMotion network speed. Once all of the virtual machines have been evacuated, the host enters maintenance mode and work on that host can begin.

Virtual SAN introduces another consideration, which is the utilization of local storage devices inside of each host. These devices contain components that make up Virtual SAN objects. Shutting down or rebooting a host naturally makes these components inaccessible until the host is back online. Let's take a closer look at how vSphere's maintenance mode has been enhanced for Virtual SAN clusters.

Thursday, September 22, 2016

Virtual SAN Availability Part 5 - Fault Domains

Fault Domains

"Fault domain" is a term that comes up fairly often in availability discussions. In IT, a fault domain usually refers to a group of servers, storage, and/or networking components that would be impacted collectively by an outage. A very common example of this is a server rack. If a top-of-rack (TOR) switch or the power distribution unit (PDU) for a server rack would fail, it would take all of the servers in that rack offline even though the servers themselves are functioning properly. That server rack is considered a fault domain.

Virtual SAN (VSAN) includes a feature called "Rack Awareness", which enables an administrator to configure fault domains in the context of a Virtual SAN cluster. Before we get into the details of this feature, let's briefly revisit the default behavior of VSAN.

Monday, September 19, 2016

Virtual SAN Availability Part 4 - Component States

VSAN Component States

VSAN components can be found in a few different states. The most common state is Active, which means the component is accessible and is up to date. Below we see two components that are Active.

Another fairly common component state is Reconfiguring. This state is observed when a change to a storage policy is made or a new storage policy is assigned to an object. For example, when the Failure Tolerance Method is changed from RAID-1 mirroring to RAID-5/6 erasure coding on an all-flash VSAN cluster. The screen shot below shows a component in the Reconfiguring state.

There are other component states related to availability that are observed when a disk or host is offline. Let's take a closer look at these states.

Wednesday, September 14, 2016

Virtual SAN Availability Part 3 - Network Partitions

VSAN Utilizes the Network

VSAN consists of two or more physical hosts typically connected by a 10GbE networking. 1GbE is supported with hybrid VSAN configurations, but 10GbE is recommended. 10GbE is required for all-flash VSAN clusters. These network connection are required to replicate data across hosts for redundancy and to share metadata updates such as the location of an object's components.

As with any storage fabric, redundant connections are strongly recommended. The VMware Virtual SAN 6.2 Network Design Guide provides more details on network configurations and recommendations. Considering VSAN's dependence on the network, this often brings up questions around what happens if one of more hosts lose network connectivity with other hosts in the cluster. This article aims to address those questions.

Friday, September 9, 2016

Virtual SAN Availability Part 2 - Storage Policies and Component Placement

Storage Policies Affect the Number of Components

In the first part of this blog series, we started with the basics of Virtual SAN (VSAN) architecture and how data is stored on a VSAN datastore. As discussed in that post, objects such as virtual disks (VMDKs) are stored as one or more components. The maximum size of a component is 255GB. If an object is larger than 255GB, it is split up into multiple components.

Another factor that affects the number of components that make up an object is the level of availability. This is determined by the availability rule(s) configured in a storage policy, which is assigned to an object. These rules and how they affect component counts is the topic of this article.

Wednesday, September 7, 2016

Virtual SAN Availability Part 1 - Intro and Basics


VMworld 2016 U.S. featured many popular breakout sessions covering VMware storage and availability products including Virtual SAN, Site Recovery Manager, and Virtual Volumes. One of these sessions is STO8179 - Understanding the Availability Features of Virtual SAN, which was delivered by GS Khalsa and I. Most of the VMworld sessions are available for playback online, but I thought it made sense to create a blog series on this topic considering the popularity of the session. A finite amount of content can be delivered within the 60-minute time frame of a VMworld breakout session. A series of blog articles enables another way to consume the information and it allows for the addition of supplemental content. This article is the first of the series. As stated in the video recording of the VMworld session, this discussion assumes you have some basic knowledge of Virtual SAN or "VSAN" as it is often called. If you need a primer, start with this VMware Virtual SAN 6.2 Data Sheet.