Storage that provides 100% performance at 99% full

A couple of weeks back we were talking with Qumulo at Storage Field Day 20 (SFD20) and they made mention that they were able to provide 100% performance at 99% full. Please see their video session during SFD20 (which can be seen here). I was a bit incredulous of this seeing as how every other modern storage system performance degrades long before they get to 99% capacity.

So I asked them to explain how this was possible. But before we get to that a little background on modern storage systems would be warranted.

The perils of log structured file systems

Most modern storage systems use a log structured file system where when they write data they write it to a sequential log and use a virtual addressing scheme to show where the data is located for that address, creating a (data) log of written blocks.

However, when data is overwritten, it leaves gaps in these data logs. These gaps need to be somehow recycled (squeezed out) in order to be able to be consumed as storage capacity. This recycling process is commonly called “garbage collection”.

Garbage collection does its work by reading heavily gapped log files and re-writing the old, but still current, data into a new log. This frees up those gaps to be reused. But garbage collection like this takes reading and writing of logs to free up space.

Now as log structured file systems get (70-80-90%) full, they need to spend more and more system time and effort (=performance) garbage collecting . This takes system (IO) performance away from normal host IO activity. Which is why I didn’t believe that Qumulo could offer 100% IO performance at 99% full.

But there was always another way to supply storage virtualization (read snapshotting) besides log files. Yes it might involve more metadata (table) management, but what it takes in more metadata, it gives back by requiring no garbage collection.

How Qumulo does without garbage collection

Qumulo has a scaled block store for a back end of their file and object cluster store. And yes it’s still a virtualized block store BUT it’s not a log structured file store.

It seems that there’s a virtual-to-physical mapping table that is used by Qumulo to determine the physical address of any virtual block in the file system. And files are allocated to virtual blocks directly through the use of B-tree metadata. These B-trees indicate which virtual blocks are in use by a file and its snapshots

If a host overwrites a data block. The block can be freed (if not being used in a snapshot) and placed on a freed block list and a new block is allocated in its place. The file’s allocated blocks b-tree is updated to reflect the new block and that’s it.

For snapshots, Qumulo uses something they call “write-out-of-place” process when data that a snapshot points to is overwritten. Again, it appears as if snapshots are some extra metadata associated with a file’s B-tree that defines the data in the snapshot.

The problem comes in when a file is deleted. If it’s a big enough file (TB-PB?), there could be millions to billions of blocks that have to be freed up. This would take entirely too long for a delete command, so this is done in the background. Qumulo calls this “reclaim delete“. So a delete of a big file unlinks the block B-tree from the directory and puts it on this reclaim delete work queue to free up these blocks later. Similarly, when a big snapshot is deleted, Qumulo performs a background process called “reclaim snapshot” for snapshot unique blocks.

As can be seen (it’s very hard to see given the coloration of the chart) from this screen shot of Qumulo’s session at SFD20, reclaim delete and reclaim snapshot are being done concurrently (in the background) with normal system IO. What’s interesting to note here is that reclaim IO (delete and snapshots) are going on all the time during the customers actual work. Why the write throughput drops significantly doing the the 27-29 of July is hard to understand. But the one case where it’s most serious (middle of July 28) reclaim IO also drops significantly. If reclaim IO were impacting write performance I would have expected it to have gone higher when write throughput went lower. But that’s not the case. From what I can see in the above reclaim IO has no impact on read or write throughput at this customer.

So essentially, by using a backing block store that does no garbage collection (not using a log structured file system), Qumulo is able to offer 100% system IO performance at 99% full – woah.

All flash storage performance testing

There are some serious problems with measuring IO performance of all flash arrays with what we use on disk storage systems. Mostly, these are due to the inherent differences between current flash- and disk-based storage.

NAND garbage collection

First off, garbage collection is required by any SSD or NAND storage to be able to write data. Garbage collection coalesces free space by moving non-modified data to new pages/blocks and freeing up the space held by old, no-longer current data.

The problem is NAND garbage collection takes place only after a suitable amount of write activity and measuring all-flash array storage system performance without taking into account garbage collection is misleading at best and dishonest at worse.

The only way to control for garbage collection is to write lots of data to a all-flash storage system and measure its performance over a protracted period of time. How long this takes is dependent on the amount of storage in an all flash array but filling it up to 75% of its capacity and then measuring IO performance as you fill up another 10-15% of its capacity with new data should suffice. Of course this would all have to be done consecutively, without any time off between runs (which would allow garbage collection to sneak in).

Flash data reduction

Second, many all flash arrays offer data reduction like data compression or deduplication. Standard IO benchmarks today don’t control for data reduction.

What we need is a standard corpus of reducible data for an IO workload. Such data would need to be able to be data compressed and data deduplicated. Unclear where such a data corpus could be found but one is needed to properly measure all flash system performance. What would help is some real world data reduction statistics, from a large number of customer installations that could help identify what real-world dedup and compression ratios look like. Then we could use these statistics to construct a suitable data load that can then be scaled and tailored to required performance needs.

Perhaps SNIA or maybe a big (government) customer could support the creation of this data corpus that can be used for “standard” performance testing. With real world statistics and a suitable data corpus, standard IO benchmarks could control for data reduction on flash arrays and better measure system performance.

Block IO differences

Third, block heat maps (access patterns) need to become much more realistic. For disk based systems it was important to randomize IO stream to minimize the advantage of DRAM caching. But with all flash storage arrays, cache is less useful and because flash can’t be rewritten in place, having IO occur to the same block (especially overwrites) causes NAND page fragmentation and more NAND write overhead.

~~~~

Only by controlling for garbage collection, using a standard, data reducible data load and returning to a cache friendly (or at least write cache friendly) workload we will truly understand all flash storage performance.

Comments?

Thanks to Larry Freeman (@Larry_Freeman) for the idea for today’s post.

Photo Credit(s): Race Faces by Jerome Rauckman