The antifragility of disk RAID groups, the fragility of SSDs and what to do about it

HDA, disk, disk head-media, Hard Disk by Jeff Kubina (cc) (from Flickr)
Hard Disk by Jeff Kubina (cc) (from Flickr)

[A long post today] I picked up the book Antifragile: Things that gain be disorder,  by Nassim N. Taleb and despite trying to put it away at least 3 times now, can’t stop turning back to it.  In his view fragility is defined by having a negative (or bad) response to variation, volatility or randomness in general.  Antifragile is the exact opposite of fragile in that it has a positive (or good) response to more variation, volatility or randomness.  Somewhere between antifragility and fragility is robustness which has neither a positive or negative response (is indifferent) to high volatility, variation or randomness.

Why disks are robust …

To me there are plenty of issues with disks. To name just a few:

  • They are energy hogs,
  • They are slow (at least in comparison to SSDs and flash memory), and
  • They are mechanical contrivances which can be harmed by excess shock/vibration.

But, aside from their capacity benefits, they have a tendency to fail at a normalized failure rate unless there is a particular (special) problem with media, batch, electronics or micro-programing.  I have seen plenty of these other types of problems at StorageTek over the years to know that there are many things that can disturb disk failure rate normalization. However, in general, absent some systematic causes of failure, disk fail at a predictable rate with a relative wide distribution (although, being away from the engineering of storage systems,  I have no statistics for the standard deviation of disk failures – it just feels right [Nassim would probably disavow me for reading that]).

The other aspect of disk anti-fragility is that as they degrade over time, they seem to get slower and louder.  The former is predominantly due to defect skipping, an error recovery procedure for bad blocks.  And they get louder as bearings start to wear out, signaling eminent failure ahead.

In defect skipping when a disk drive detects a bad block, the disk drive marks the block as bad and uses a spare block it has somewhere else in the disk for all subsequent writes. The new block is typically “far” away from the old block so when reading multiple blocks the drive has to now seek to the new block and seek back to read them. increasing response time in the process.

The other phenomona that disk failures have is a head crash. These seem to occur at completely at random with disks from “mature processes”.

So, I believe disks from mature processes have a normalized failure rate with a reasonably wide standard of deviation around this MTBF rate. As such, disk drives should be classified as robust.

… and RAID groups of disk drives are antifragile

But, while disk drives are robust, when you place such devices in a RAID group with others, the RAID groups survive better.  As long as the failure rate of the devices is randomized and there is a wide variance on this failure rate, when a RAID group encounters a single drive failure it is unlikely that a second, or third (RAID DP/6) will also fail while trying to recover from the first.  (Yes, as disk drives get larger the time to recover gets longer thus increasing the probability of multiple drive failures, but absent systematic causes of drive failures, the likelihood of data loss should be rare).

In a past life we had multiple disk systems in a location subject to volcanic activity. Somehow, sulferic fumes from the volcano had found its way into the computer room and was degrading the optical transceivers in our disk drives causing drive failures.   The subsystem at the time had RAID 6 (dual parity) and over the course of a few weeks we had 20 or more disk drives die in these systems. The customer lost no data during this time but only because the disk drive failure rate was randomly distributed over time with a wide dispersion.

So from Nassim’s definition disk RAID groups are anti-fragile, they do operate better with more randomness.

Why SSD and SSD RAID groups are fragile

SSD, Toshiba's New 2.5" SSD from SSD.Toshiba.com
Toshiba’s New 2.5″ SSD from SSD.Toshiba.com

SSDs have a number of good things going for them. For example:

  • They are blistering fast,  at least compared to rotating disks.
  • They are relatively green storage devices meaning they use less energy than rotating disk
  • They are semiconductor devices and as such, are relatively immune to shock and vibration.

Despite all that, given todays propensity to use wear leveling, RAID groups composed of SSDs can exhibit fragility because all the SSDs will fail at approximately the same number of Program/Erase cycles.

My assumption, is that because NAND wear out is essentially an electro-chemical phenomenon that its failure rate, while a normalized distribution, probably has a very narrow variation.  Now given the technology NAND pages will fail after so many writes, it may be 10K, 30K or 100K (for MLC, eMLC, or SLC) but all the NAND pages from the same technology (manufactured on the same fab line) will likely fail at about the same number of P/E cycles. With wear leveling equalizing the P/E cycles across all pages in an SSD, this means that there is some number of writes that an SSD will endure and then go no farther.  (Again, I have no hard statistics to support this presumption and Nasssim will probabilistically not be pleased with me for saying this).

As such, for a RAID group made up of wear leveling SSDs especially with data stripping across the group, all the SSDs will probabilistically fail at almost same time because they all will have had the same amount of data written to them.  This means that as we reach wear out on one SSD in the group, assuming all the others were also fresh at the time of original creation of the group, then all the other devices will be near wear out.  As a result, when one SSD fails, others in the RAID group will have a high probability of failure, leading to data loss.

I have written about this before, see my Potential data loss using SSD RAID groups post for more information.

What we can do about the fragility of SSD RAID groups?

A couple of items come to mind that can be done to reduce the fragility of a RAID group of SSDs:

  • Intermix older and newer (fresher) SSDs in a single RAID group to not cause them all to fail at the same time.
  • Don’t use data striping across RAID groups of SSDs this would allow some devices to be written more than others and by doing so cause some randomness to the SSD failures in the group.
  • Don’t use RAID 1 as this will always cause the same number of writes to be written to pairs of SSDs
  • Don’t use RAID 5 or other protection methodologies that spread parity writes across the group, using these would be akin to data striping in that all parity writes would be spread evenly across the group.
  • Consider using different technology SSDs in a RAID group, if one intermixed MLC, eMLC and SLC drives in a RAID group this would have the effect of varying the SSD failure rates.
  • Move away from wear leveling to defect skipping while doing so will cause some SSDs to fail earlier than today, their failure rate will be randomly distributed.

The last one probably deserves some discussion.  There are many reasons for wear leveling one of which is to speed up writes (by always having a fresh page to write), another is that NAND blocks cannot be updated in place, they need to be erased to be written.  But another major reason is to distribute write activity across all NAND pages to equalize wear out.

In order to speed up writing sans wear leveling one would need some sort of DRAM buffer to absorb the write activity and then later destage it to NAND when available.   The inability to update in place is more problematic but could potentially be dealt with by using the same DRAM cache to read in the previous information and write back the updates.  Other solutions to this later problem exist but seem to be more problematic than they are worth.

But for the aspect of wear leveling done to equalize NAND page wearout, I believe there’s a less fragile solution.  If we were to institute some form of defect skipping with a certain amount of spare NAND pages, we could easily extend the life of an SSD, at least until we run out of spare pages.

Today, there is a considerable amount of spare capacity shipped with most SSDs, over 10% in most enterprise class storage and more with consumer grade. With this much capacity a single NAND logical block could be rewritten an awful high number of times. For instance using defect skipping, with a 100GB MLC SSD at 10K write endurance with 10% spare pages and a 1MB page size, one single logical block address page could written ~100million times (assuming no other pages were being written beyond their maximum).

The main advantage is that, now SSD failure rates would be more widely distributed. Yes there would be more early life failures, especially for SSDs that get hit a lot. But they would no longer fail in unison at some magical write level.

Making SSDs less fragile

While doing all the above may help a RAID group full of SSDs be less fragile, addressing the inherent antifragility of an SSD is more problematic.  Nonetheless, some ideas do come to mind:

  • Randomly mix NAND chips from different FABs/vendors, then the SSDs that use this intermixture could have a more randomly distributed failure rate, which should increase the standard deviation of MTBF.
  • Use different NAND technologies in an SSD, using say MLC for the bulk of the storage capacity and SLC for the defect skip capacity on an SSD (with no wear leveling). Doing this would elongate the lifetime of the average SSD and randomly distribute failures of SSDs based on write locality of reference thereby increasing the standard deviation of MTBF.  Of course this would also have the affect of speeding up heavily written blocks now coming out of SLC rather than slower MLC, making these SSDs even faster for those blocks which are written more frequently.
  • Use more random, less deterministic predictive maintenance, SSD predictive maintenance is used to limit the damage from a failing SSD by replacing it before death. By using less deterministic algorithms and more randomized algorithms  (such as how close to wear out we let the SSD get before signaling failure) we would have the impact of increasing the variance of failure.

This post is almost too long now but there are probably other ideas to increase the robustness of SSDs and PCIe Flash cards that deserve mention someplace. Maybe we can explore these in a subsequent post.

Comments?

[Full disclosure:  I have a number of desktops that use single disk drives (without RAID) that are backed up to other disk drives.  I own and use a laptop, iPads, and an iPhone that all use SSDs or NAND technology (without RAID). I have neither disk or SSD storage subsystems that I own.]