Failure Scenarios for vSAN
Had a question recently around the failure scenarios in vSAN. FTT=1 is great when the cluster is small, but what if the cluster size increases? Should FTT tolerance increase accordingly?
Let's have a look a little closer as to how this works. The discussion today is mainly on data placement, not compute. It is assumed that with compute, if the failure happens on the node where there is compute, it is a given that the VM is rebooted on another host (except of course it is in FT mode).
3-Node Failure Scenario
Assuming we have a typical 3-node setup with each node having 2 capacity disks. We enable FTT=1 / FTM=RAID 1, the layout will look a little like the diagram below. As you can see, there will be 2 set's of mirrored data and a witness component. Witness component is used to prevent a split-brain scenario.
(Fig. 1) Typical 3-node Cluster
(Fig. 2) 3-node Cluster Single Node Failure
When a failure happens to a node completely, it will look something like the diagram above. Given that we have still a good copy of data available, there will not be any impact to production IO's. However, having said that, vSAN will not necessarily have resources to repair and rebuild the replacement copy. This is because, we will need a minimum of 3 different nodes to place the data.
Node failures are rare, so lets look at a more common scenario where a drive with the data fails. In this instance, the data will be automatically rebuilt / recreated assuming there is sufficient spare capacity. The diagram below shows how data is recreated on the other surviving drive on the same node.
(Fig. 3) 3-node Cluster Single Drive Failure
4-Node Failure Scenario
Now, lets look at a 4-node setup. The following diagram shows a a single node failure in a 4-node cluster. It looks fairly similar to a 3-node setup with the exception that with a 4-node cluster, data can be rebuilt immediately on the remaining node.
(Fig. 4) Typical 4-node cluster
(Fig. 5) 4-node Cluster Single Node Failure
Lets change it up slightly. Assuming now with a 4-node cluster, 2-node fails (1 node contains data, and the other doesn't contain data/witness components). Technically a 2-node failure, in a FTT=1 setup would have violated the policy, potentially taking the VM offline.
The diagram below shows 2-node failure and in this case does not take the VM offline, because the VM is still compliant against FTT=1 while the cluster may not.
(Fig. 6) 4-node Cluster Double Node Failure
So as you can imagine, in a large cluster where VM's and data is spread out fairly evenly, if a failure exceeds the defined FTT policy, there is a possibility that some VM's may be affected, and some will continue as usual. The screenshot below shows a 4-node cluster with 2 node failures.
Dummy1 VM is still in "Reduced availability with no rebuild" because it has at least 2 of 3 surviving components. Dummy2 VM on the other hand, have failed 2 of 3, hence its showing "Inaccessible".
(Fig. 7) vCenter Virtual Objects Screen after 2 node failures
So back to the first question, as the cluster grows, is it wise to extend the FTT policy?
It really depends on the requirement of those individual VM's. Unfortunately this varies from applications to applications and organisations, but I hope by understanding how vSAN deals with failures, this will help you decide which is a better approach.
Update [Feb 19, 2020] - Included a link to a video I did a while back on disk failures on vSAN. I also wrote briefly about it [link].