Surviving 63 Drives Failure on vSAN

One of the most common discussion topics around vSAN is around availability. There are also a lot of misconceptions and FUD (fear, uncertainty and doubts) around this topic. What better way to debunk it by doing a little test on failures!

I don't want to just do a simple, single drive type failure, but I wanted it to be a large scale failure. As large as my lab environment could hold that would also resemble a typical enterprise environment.

So here is how it ended up looking like.

7 x vSphere ESXi 6.5 nodes (nested)
- each node has 3 Disk Groups (each with 1 x Cache Drive, 7 x Capacity Drives)
- a total of 24 x Drives which is very similar to most Ready Node configs
- did consider maxing out the drives to 35 per node, but ran out of lab resources 

The policy created for the VM was Fault-to-Tolerate (FTT) = 3; Raid-1, Stripe Width = 12. Increased the Stripe Width to maximum to have it hit as many drives as possible. 

We will then fail 3 x vSAN Nodes and verify to see if the host is still able to run and copy files. 3 host failures will mean a total of 63 Capacity Drive Failures! 

Will be interesting to see a traditional SAN have that amount of drive failures.

Charles Chow

I am an IT Practitioner (my day job) that have been across multiple roles ranging from end-user, post-sales, pre-sales, sales, and management.

I enjoy everything that is technology and a big advocate in embracing new tech. I love taking things apart and understanding how it works, in the process appreciating the engineering that goes into it.

Sometimes, I take my passion at work and apply it to my hobbies as well aka cycling.

Previous
Previous

How Long Does It Take To Recover From vSAN Drive or Node Failures?

Next
Next

Caveat : Running Nested vSAN on a vSAN Cluster