The antifragility of disk RAID groups, the fragility of SSDs and what to do about it

HDA, disk, disk head-media, Hard Disk by Jeff Kubina (cc) (from Flickr)
Hard Disk by Jeff Kubina (cc) (from Flickr)

[A long post today] I picked up the book Antifragile: Things that gain be disorder,  by Nassim N. Taleb and despite trying to put it away at least 3 times now, can’t stop turning back to it.  In his view fragility is defined by having a negative (or bad) response to variation, volatility or randomness in general.  Antifragile is the exact opposite of fragile in that it has a positive (or good) response to more variation, volatility or randomness.  Somewhere between antifragility and fragility is robustness which has neither a positive or negative response (is indifferent) to high volatility, variation or randomness.

Why disks are robust …

To me there are plenty of issues with disks. To name just a few:

  • They are energy hogs,
  • They are slow (at least in comparison to SSDs and flash memory), and
  • They are mechanical contrivances which can be harmed by excess shock/vibration.

But, aside from their capacity benefits, they have a tendency to fail at a normalized failure rate unless there is a particular (special) problem with media, batch, electronics or micro-programing.  I have seen plenty of these other types of problems at StorageTek over the years to know that there are many things that can disturb disk failure rate normalization. However, in general, absent some systematic causes of failure, disk fail at a predictable rate with a relative wide distribution (although, being away from the engineering of storage systems,  I have no statistics for the standard deviation of disk failures – it just feels right [Nassim would probably disavow me for reading that]).

The other aspect of disk anti-fragility is that as they degrade over time, they seem to get slower and louder.  The former is predominantly due to defect skipping, an error recovery procedure for bad blocks.  And they get louder as bearings start to wear out, signaling eminent failure ahead.

In defect skipping when a disk drive detects a bad block, the disk drive marks the block as bad and uses a spare block it has somewhere else in the disk for all subsequent writes. The new block is typically “far” away from the old block so when reading multiple blocks the drive has to now seek to the new block and seek back to read them. increasing response time in the process.

The other phenomona that disk failures have is a head crash. These seem to occur at completely at random with disks from “mature processes”.

So, I believe disks from mature processes have a normalized failure rate with a reasonably wide standard of deviation around this MTBF rate. As such, disk drives should be classified as robust.

… and RAID groups of disk drives are antifragile

But, while disk drives are robust, when you place such devices in a RAID group with others, the RAID groups survive better.  As long as the failure rate of the devices is randomized and there is a wide variance on this failure rate, when a RAID group encounters a single drive failure it is unlikely that a second, or third (RAID DP/6) will also fail while trying to recover from the first.  (Yes, as disk drives get larger the time to recover gets longer thus increasing the probability of multiple drive failures, but absent systematic causes of drive failures, the likelihood of data loss should be rare).

In a past life we had multiple disk systems in a location subject to volcanic activity. Somehow, sulferic fumes from the volcano had found its way into the computer room and was degrading the optical transceivers in our disk drives causing drive failures.   The subsystem at the time had RAID 6 (dual parity) and over the course of a few weeks we had 20 or more disk drives die in these systems. The customer lost no data during this time but only because the disk drive failure rate was randomly distributed over time with a wide dispersion.

So from Nassim’s definition disk RAID groups are anti-fragile, they do operate better with more randomness.

Why SSD and SSD RAID groups are fragile

SSD, Toshiba's New 2.5" SSD from SSD.Toshiba.com
Toshiba’s New 2.5″ SSD from SSD.Toshiba.com

SSDs have a number of good things going for them. For example:

  • They are blistering fast,  at least compared to rotating disks.
  • They are relatively green storage devices meaning they use less energy than rotating disk
  • They are semiconductor devices and as such, are relatively immune to shock and vibration.

Despite all that, given todays propensity to use wear leveling, RAID groups composed of SSDs can exhibit fragility because all the SSDs will fail at approximately the same number of Program/Erase cycles.

My assumption, is that because NAND wear out is essentially an electro-chemical phenomenon that its failure rate, while a normalized distribution, probably has a very narrow variation.  Now given the technology NAND pages will fail after so many writes, it may be 10K, 30K or 100K (for MLC, eMLC, or SLC) but all the NAND pages from the same technology (manufactured on the same fab line) will likely fail at about the same number of P/E cycles. With wear leveling equalizing the P/E cycles across all pages in an SSD, this means that there is some number of writes that an SSD will endure and then go no farther.  (Again, I have no hard statistics to support this presumption and Nasssim will probabilistically not be pleased with me for saying this).

As such, for a RAID group made up of wear leveling SSDs especially with data stripping across the group, all the SSDs will probabilistically fail at almost same time because they all will have had the same amount of data written to them.  This means that as we reach wear out on one SSD in the group, assuming all the others were also fresh at the time of original creation of the group, then all the other devices will be near wear out.  As a result, when one SSD fails, others in the RAID group will have a high probability of failure, leading to data loss.

I have written about this before, see my Potential data loss using SSD RAID groups post for more information.

What we can do about the fragility of SSD RAID groups?

A couple of items come to mind that can be done to reduce the fragility of a RAID group of SSDs:

  • Intermix older and newer (fresher) SSDs in a single RAID group to not cause them all to fail at the same time.
  • Don’t use data striping across RAID groups of SSDs this would allow some devices to be written more than others and by doing so cause some randomness to the SSD failures in the group.
  • Don’t use RAID 1 as this will always cause the same number of writes to be written to pairs of SSDs
  • Don’t use RAID 5 or other protection methodologies that spread parity writes across the group, using these would be akin to data striping in that all parity writes would be spread evenly across the group.
  • Consider using different technology SSDs in a RAID group, if one intermixed MLC, eMLC and SLC drives in a RAID group this would have the effect of varying the SSD failure rates.
  • Move away from wear leveling to defect skipping while doing so will cause some SSDs to fail earlier than today, their failure rate will be randomly distributed.

The last one probably deserves some discussion.  There are many reasons for wear leveling one of which is to speed up writes (by always having a fresh page to write), another is that NAND blocks cannot be updated in place, they need to be erased to be written.  But another major reason is to distribute write activity across all NAND pages to equalize wear out.

In order to speed up writing sans wear leveling one would need some sort of DRAM buffer to absorb the write activity and then later destage it to NAND when available.   The inability to update in place is more problematic but could potentially be dealt with by using the same DRAM cache to read in the previous information and write back the updates.  Other solutions to this later problem exist but seem to be more problematic than they are worth.

But for the aspect of wear leveling done to equalize NAND page wearout, I believe there’s a less fragile solution.  If we were to institute some form of defect skipping with a certain amount of spare NAND pages, we could easily extend the life of an SSD, at least until we run out of spare pages.

Today, there is a considerable amount of spare capacity shipped with most SSDs, over 10% in most enterprise class storage and more with consumer grade. With this much capacity a single NAND logical block could be rewritten an awful high number of times. For instance using defect skipping, with a 100GB MLC SSD at 10K write endurance with 10% spare pages and a 1MB page size, one single logical block address page could written ~100million times (assuming no other pages were being written beyond their maximum).

The main advantage is that, now SSD failure rates would be more widely distributed. Yes there would be more early life failures, especially for SSDs that get hit a lot. But they would no longer fail in unison at some magical write level.

Making SSDs less fragile

While doing all the above may help a RAID group full of SSDs be less fragile, addressing the inherent antifragility of an SSD is more problematic.  Nonetheless, some ideas do come to mind:

  • Randomly mix NAND chips from different FABs/vendors, then the SSDs that use this intermixture could have a more randomly distributed failure rate, which should increase the standard deviation of MTBF.
  • Use different NAND technologies in an SSD, using say MLC for the bulk of the storage capacity and SLC for the defect skip capacity on an SSD (with no wear leveling). Doing this would elongate the lifetime of the average SSD and randomly distribute failures of SSDs based on write locality of reference thereby increasing the standard deviation of MTBF.  Of course this would also have the affect of speeding up heavily written blocks now coming out of SLC rather than slower MLC, making these SSDs even faster for those blocks which are written more frequently.
  • Use more random, less deterministic predictive maintenance, SSD predictive maintenance is used to limit the damage from a failing SSD by replacing it before death. By using less deterministic algorithms and more randomized algorithms  (such as how close to wear out we let the SSD get before signaling failure) we would have the impact of increasing the variance of failure.

This post is almost too long now but there are probably other ideas to increase the robustness of SSDs and PCIe Flash cards that deserve mention someplace. Maybe we can explore these in a subsequent post.

Comments?

[Full disclosure:  I have a number of desktops that use single disk drives (without RAID) that are backed up to other disk drives.  I own and use a laptop, iPads, and an iPhone that all use SSDs or NAND technology (without RAID). I have neither disk or SSD storage subsystems that I own.]

 

Intel’s 320 SSD “8MB problem”

Intel SSD 320_001 by TAKA@P.P.R.S (cc) (from Flickr)
Intel SSD 320_001 by TAKA@P.P.R.S (cc) (from Flickr)

Read a recent news item on Intel being able to replicate their 320 SSD 8MB problem that some customers have been complaining about.

Apparently the problem occurs when power is suddenly removed from the device.  The end result is that the SSD’s capacity is restricted to 8MB from 40GB or more.

I have seen these sorts of problems before.  It probably has something to do with table updating activity associated with SSD wear leveling.

Wear leveling

NAND wear leveling looks very similar to virtual memory addressing and maps storage block addresses to physical NAND pages. Essentially something similar to a dynamic memory page table is maintained that shows where the current block is located in the physical NAND space, if present.  Typically, there are multiple tables involved, one for spare pages, another for mapping current block addresses to NAND page location and offset, one for used pages, etc.  All these tables have to be in some sort of non-volatile storage so they persist after power is removed.

Updating such tables and maintaining their integrity is a difficult endeovor.  More than likely some sort of table update is not occurring in an ACID fashion.

Intel’s fix

Intel has replicated the problem and promises a firmware fix. In my experience this is entirely possible.  Most probably customer data has not been lost (although this is not a certainty), it’s just not accessible at the moment. And Intel has reminded everyone that as with any storage device everyone should be taking periodic backups to other devices, SSDs are no exception.

I am certain that Intel and others are enhancing their verification and validation (V&V) activities to better probe and test the logic behind wear leveling fault tolerance, at least with respect to power loss. Of course, redesigning the table update algorithm to be more robust, reliable, and fault tolerant is a long range solution to these sorts of problems but may take longer than a just a bug fix.

The curse of complexity

But all this begs a critical question, as one puts more and more complexity outboard into the drive are we inducing more risk?

It’s a perennial problem in the IT industry. Software bugs are highly correlated to complexity and thereby, are ubiquitous, difficult to eliminate entirely, and often escape any and all efforts to eradicate them before customer shipments.  However, we can all get better at reducing bugs, i.e., we can make them less frequent, less impactful, and less visible.

What about disks?

All that being said, rotating media is not immune to the complexity problem. Disk drives have different sorts of complexity, e.g., here block addressing is mostly static and mapping updates occur much less frequently (for defect skipping) rather than constantly as with NAND, whenever data is written.  As such, problems with power loss impacting table updates are less frequent and less severe with disks.  On the other hand, stiction, vibration, and HDI are all very serious problems with rotating media but SSDs have a natural immunity to these issues.

—-
Any new technology brings both advantages and disadvantages with it.  NAND based SSD advantages include high speed, low power, and increased ruggedness but the disadvantages involve cost and complexity.  We can sometimes tradeoff cost against complexity but we cannot eliminate it entirely.

Moreover, while we cannot eliminate the complexity of NAND wear leveling today, we can always test it better.  That’s probably the most significant message coming out of today’s issue.  Any product SSD testing has to take into account the device’s intrinsic complexity and exercise that well, under adverse conditions. Power failure is just one example, I can think of dozens more.

Comments?