High Capacity VSAN Nodes Part 2

This entry is part 2 of 3 in the series High Capacity VSAN Nodes


This is part 2 of a series of lessons learned and examples worked based on a 400+ TB VSAN solution I helped a partner engineer. I’m comparing high capacity VSAN nodes with standard configurations. This post will focus on calculating the IOPS required to de-stage writes from the caching layer, then examining the implications on hardware choices.

Calculating IOPS Required

The caching layer acts to soak up spikes of data and IOPS which come up during the course of operations (for example, a periodic dump of meteorological images). When using low-density (IOPS/TB) storage, the rate at which the data can be de-staged from the caching layer to the persistence layer becomes very important. In the example burst of data, it’s important to know the average rate at which data is being written to the storage. Though the caching layer can soak up the spikes, ultimately, the persistence layer needs to be able to handle the data being de-staged to it. Keep in mind that policies might dictate more than a single copy of data be written. Let’s consider 1 TB ingested every day. This is a round number that I picked which has profound implications on the worked example. Every situation is different, so you should really examine the specifics of yours. Some rough back-of-the-napkin math:

    \[ 10^3 \frac{MB}{day} * \frac{1}{24}\frac{day}{hours} * \frac{1}{60}\frac{hour}{minutes} * \frac{1}{60}\frac{minutes}{seconds} * 2 copies ~= 23.2\frac{MB}{second} \]

That’s 11.6 MB/s for a single copy. Since the policy dictates two copies, that’s 23.2 MB/s.


Implications of Block Size

To calculate the IOPS one needs to handle that de-staging throughput, consider the block size being written. This is actually an interesting advantage of VSAN, as operating systems typically use 4KB blocks, VMFS is traditionally 1 MB, but the VSAN 6.0 FS (VirstoFS) can write in 4MB chunks while using 512 bytes allocation units. Why this is significant? VirstoFS can capture block changes at the granularity of 512 bytes, so can create very small, performant snapshots. But imagine if it could only write 512 byte blocks:

    \[ 23.2\frac{MB}{second} * 1000 \frac{KB}{MB} * 1000 \frac{byte}{KB} * \frac{1}{512}\frac{IO}{byte} = 45,312 \frac{IO}{second} \]

You can immediately see that the block size for this solution will have another profound impact on the IOPS. This is something that needs to be examined a bit further and probably tested on the final solution. If 4KB blocks are being used, for example, we’d need around 5,800 IOPS. 1MB blocks would require 23.2 IOPS. If 4MB blocks are used…

    \[ 23.2 \frac{MB}{second} * \frac{1}{4}\frac{IO}{MB} = 5.8 IOPS \]


That is a big difference.

Great, we now know that for our hypothetical 1TB/day ingest rate, we only need 5.8 IOPS?! Even with our low IOPS/TB density storage, that’s still handled by a single drive at 100 IOPS. Imagine if we were using 4KB blocks:

    \[ \frac{5,800 IOPS}{100 \frac{IOPS}{drive}} = 58 drives \]

We’d need 58 drives just for IOPS! With Large Form Factor (LFF) drives, we can only fit 12 per 2U host, which would mean:

    \[ \left\lceil \frac{58 drives}{12 \frac{drives}{host}}} \right\rceil = 6 hosts \]

Note: \lceil \rceil  is the “ceiling” function which represents rounding up to the next integer

Capacity Required

As it is, the real issue we face is the retention rate for our data. If it needs to be held for a year, then we need:

(1)   \begin{multline*} \left \lceil 1\frac{TB}{day} * 365 \frac{day}{year} * 2 \text{copies} * \frac{1}{.7} \text{slack space} * \frac{1}{.99} \text{formatting overhead} \right \rceil \\ = 1054 TB \end{multline*}

    \[ \left \lceil \frac{1054 TB}{6\frac{TB}{drive}} \right \rceil = 176 drives \]

    \[ \left \lceil \frac{176drives}{12\frac{drives}{host}} \right \rceil = 15 host \]

So we need only need 1 drive for the IOPS requirements, but 176 drives housed in 15 hosts for the capacity requirement. Capacity wins, and we’ll have excess IOPS. Keep in mind that in this case we needed more disks for capacity (176) than we did even if we needed 5,800 IOPS. Oh, how many IOPS do we end up with?

    \[ 122 drives * 100 \frac{IOPS}{drive} = 12,200 IOPS \]

Now let’s compare that to using Small Form Factor (SFF) 10K RPM drives at 140 IOPS/drive.
For capacity requirements:

    \[ \left \lceil \frac{1054 TB}{1.2\frac{TB}{drive}} \right \rceil = 879 drives \]

    \[ \left \lceil \frac{879drives}{21\frac{drives}{host}} \right \rceil = 42 hosts \]

Again, for each project we have to meet both capacity and IOPS requirements, so we choose the larger of the two numbers and go with 879 drives housed in 42 hosts. This solution will also give us an excess of IOPS:

    \[ 879 drives * 140 \frac{IOPS}{drive} = 123,060 IOPS \]

Just to re-emphasize, this is a measure of the rate at which the cache layer can de-stage writes to the persistence layer and perform read cache misses.

At this point, we can see why a higher density solution would be more cost effective; Using LFF drives uses 16 fewer hosts, and either 16 or 32 fewer licenses, depending on whether populating one or two CPU sockets per host (depending on compute requirements).

Up next: Flexibility of Purchased Capacity

Photo by KimManleyOrt

Series Navigation<< High Capacity VSAN Nodes Part 1High Capacity VSAN Nodes Part 3 >>

image sources

Published by


John White is walking the path to virtualization mastery.

2 thoughts on “High Capacity VSAN Nodes Part 2”

  1. Question, why did you use ’12’ for the number of LFF drives in a 2U? I’m looking at a DL380 Gen 9 for example (which I know wasn’t around when this post came out), but it only supports 12 total drives. You would need to split this into two disk groups and have a capacity drive, so 1+5, 1+5, resulting in 10 capacity drives.

    1. Thanks for the question, ikiris: Short answer is rear-facing expansion slots (definitely available on the DL380 gen9 with either 2 SFF or 3 LFF). Even more likely is that the capacities of the spinning disks are so large that you need PCIe SSD to get the capacities necessary. Make sense?

Leave a Reply