Friday, September 9, 2016

Virtual SAN Availability Part 2 - Storage Policies and Component Placement

Storage Policies Affect the Number of Components

In the first part of this blog series, we started with the basics of Virtual SAN (VSAN) architecture and how data is stored on a VSAN datastore. As discussed in that post, objects such as virtual disks (VMDKs) are stored as one or more components. The maximum size of a component is 255GB. If an object is larger than 255GB, it is split up into multiple components.

Another factor that affects the number of components that make up an object is the level of availability. This is determined by the availability rule(s) configured in a storage policy, which is assigned to an object. These rules and how they affect component counts is the topic of this article.

Storage Policy Based Management (SPBM)

Let's start with some basics of storage policies. Storage Policy Based Management (SPBM) is the management framework utilized to govern storage services applied to objects on VSAN and Virtual Volumes (VVols). There are plenty of other sources available that provide a closer look at SPBM such as this blog article. To summarize, a storage policy consists of one or more rules that determine items such as the level of availability, stripe width, IOPS limit, etc. Here is screen shot of a storage policy that has two rules configured.

We will explain the two rules shown in this storage policy in a moment. A storage policy is applied to a VM or an individual VMDK. Policies can be applied to objects and modified dynamically with no downtime. This makes it very easy to change the type of availability protection, for example, from mirroring to erasure coding without having to migrate (Storage vMotion) the VM from one LUN or volume to another. This 40-second silent video shows how easy it is to change a storage policy (view full screen for best quality)...

As shown in the video, it is possible to apply a policy to an individual virtual disk or to all objects that make up a VM enabling very precise control of the storage services provided to each object.

Protection Against Disk and Host Failure

The most common question asked when discussing VSAN availability is how VSAN protects against a disk or entire host failure. The most common method is by mirroring objects across multiple hosts. For example, if we configure a storage policy to contain a rule that maintains availability in the event of a disk or host failure, VSAN will create two replicas of an object and place them on separate hosts. If one of the hosts or disk(s) contained in a host fails, the object is still accessible using the replica on the other host. In other words, we can tolerate the failure of one host and still maintain access to the data. The simple formula: Data availability after N fault domain failures requires N+1 replicas.

The storage policy rule that determines the level of availability of an object on VSAN is "Number of Failures to Tolerate" or "FTT". If FTT=1, we are instructing VSAN to create and place replicas of the object so that the failure of a disk or host can be tolerated. There is also a rule named "Failure Tolerance Method", which contains two options: RAID-1 mirroring and RAID-5/6 erasure coding. RAID-1 mirroring with FTT=1 are the default settings for VSAN. We will discuss RAID-5/6 erasure coding shortly.

Witness Components

As with many clustering solutions, having an even number of replicas creates the risk of a "split brain" scenario when the hosts containing the replicas are unable to communicate with each other over the network. To solve this issue, VSAN will typically create another component called a "witness". This witness component is small - approximately 4MB - and serves as a tie-breaker when a partition occurs.

For an object to be accessible, VSAN must achieve quorum for the object by having access to more than 50% of the active components that make up an object. In the example above, a minimum of both data components or a data component and the witness component must be accessible to have a quorum. If any two of the three components above (or all three) are offline, the object would not be accessible until greater than 50% are back online.

It is important to note that witness components might not be utilized in all cases. Recent versions of VSAN utilize the concept of "votes". More than 50% of the votes must be available to achieve quorum. Each component usually has a vote count equal to 1. In some cases, if the number of replicas (each with 1 vote) is even, VSAN will simply add a vote to one of the components making the number of votes odd. Quorum is achieved as long as more than 50% of the votes are accessible.

RAID-5/6 Erasure Coding

As mentioned earlier, the other failure tolerance method is RAID-5/6 erasure coding. Selecting this option for the Failure Tolerance Method rule enables FTT=1 and FTT=2 resilience while reducing the amount of raw capacity consumed when compared to RAID-1 mirroring. For example, a 100GB object protected by RAID-1 mirroring and FTT=1 consumes 200GB, i.e., 2 x 100GB replicas. That same 100GB object protected by RAID-5 erasure coding (FTT=1) consumes 133GB of raw storage - a 33% reduction versus RAID-1 mirroring. The following diagram shows component placement in a 4-node all-flash VSAN cluster for an object protected by RAID-5 erasure coding.

Note there are three data components plus a parity component. At least three of the four components must be present for this object to be accessible. in other words, we can tolerate the loss of any one host and maintain accessibility, i.e., FTT=1. If FTT is set to 2 and RAID-5/6 erasure coding is selected as the failure tolerance method, we would see six components distributed across six hosts - four data components and two parity components. RAID-6 erasure coding (FTT=2) reduces raw capacity consumption by 50% versus FTT=2 with RAID-1 mirroring.

While RAID-5/6 erasure coding does help conserve capacity, there are tradeoffs. Fewer hosts are required to satisfy FTT=1 and FTT=2 RAID-1 mirroring policies. As an example, we saw above that a minimum of three hosts are required for RAID-1 mirroring while RAID-5 erasure coding requires at least four hosts. Erasure coding also introduces write amplification as parity components must also be updated when data components are written to. This failure tolerance method has the potential to impact performance and this is the case with any type of storage (not just VSAN). Since VSAN requires the use of all flash devices to support RAID-5/6 erasure coding, little if any performance impact is observed. For a deeper conversation on how VSAN utilizes RAID-5/6 erasure coding, take a look at this blog article.

Failure Tolerance Method Recommendations

Considering the two failure tolerance methods, you might be wondering if there are recommendations around which method to use. Generally speaking, RAID-1 mirroring should be used if performance is most important and you have plenty of raw capacity. RAID-1 mirroring is your only option if you are using a hybrid VSAN configuration (magnetic disks in the capacity tier). If you have an all-flash VSAN configuration and space efficiency is more important, RAID-5/6 erasure coding reduces capacity consumption with very little or no impact to performance. Keep in mind it is possible to assign a different policy to each object. Consider a VM running a database - it might make sense to have a RAID-1 policy assigned to some VMDKs and a RAID-5/6 policy assigned the other VMDKs. This is an excellent example showing one of the many benefits of SPBM and VSAN.

That does it for Part 2 of this series. Hopefully, you have a better understanding of how VSAN distributes components across hosts in a cluster based on the applied storage policy to maintain the availability of an object when one or more disks or hosts are offline. We mentioned network partitions earlier - that is the next topic in Part 3 of this series.


No comments:

Post a Comment

Note: Only a member of this blog may post a comment.