Surviving 63 Drives Failure on vSAN
One of the most common discussion topics around vSAN is around availability. There are also a lot of misconceptions and FUD (fear, uncertainty and doubts) around this topic. What better way to debunk it by doing a little test on failures!
I don't want to just do a simple, single drive type failure, but I wanted it to be a large scale failure. As large as my lab environment could hold that would also resemble a typical enterprise environment.
So here is how it ended up looking like.
7 x vSphere ESXi 6.5 nodes (nested)
- each node has 3 Disk Groups (each with 1 x Cache Drive, 7 x Capacity Drives)
- a total of 24 x Drives which is very similar to most Ready Node configs
- did consider maxing out the drives to 35 per node, but ran out of lab resources
The policy created for the VM was Fault-to-Tolerate (FTT) = 3; Raid-1, Stripe Width = 12. Increased the Stripe Width to maximum to have it hit as many drives as possible.
We will then fail 3 x vSAN Nodes and verify to see if the host is still able to run and copy files. 3 host failures will mean a total of 63 Capacity Drive Failures!
Will be interesting to see a traditional SAN have that amount of drive failures.