The antifragility of disk RAID groups, the fragility of SSDs and what to do about it

HDA, disk, disk head-media, Hard Disk by Jeff Kubina (cc) (from Flickr)
Hard Disk by Jeff Kubina (cc) (from Flickr)

[A long post today] I picked up the book Antifragile: Things that gain be disorder,  by Nassim N. Taleb and despite trying to put it away at least 3 times now, can’t stop turning back to it.  In his view fragility is defined by having a negative (or bad) response to variation, volatility or randomness in general.  Antifragile is the exact opposite of fragile in that it has a positive (or good) response to more variation, volatility or randomness.  Somewhere between antifragility and fragility is robustness which has neither a positive or negative response (is indifferent) to high volatility, variation or randomness.

Why disks are robust …

To me there are plenty of issues with disks. To name just a few:

  • They are energy hogs,
  • They are slow (at least in comparison to SSDs and flash memory), and
  • They are mechanical contrivances which can be harmed by excess shock/vibration.

But, aside from their capacity benefits, they have a tendency to fail at a normalized failure rate unless there is a particular (special) problem with media, batch, electronics or micro-programing.  I have seen plenty of these other types of problems at StorageTek over the years to know that there are many things that can disturb disk failure rate normalization. However, in general, absent some systematic causes of failure, disk fail at a predictable rate with a relative wide distribution (although, being away from the engineering of storage systems,  I have no statistics for the standard deviation of disk failures – it just feels right [Nassim would probably disavow me for reading that]).

The other aspect of disk anti-fragility is that as they degrade over time, they seem to get slower and louder.  The former is predominantly due to defect skipping, an error recovery procedure for bad blocks.  And they get louder as bearings start to wear out, signaling eminent failure ahead.

In defect skipping when a disk drive detects a bad block, the disk drive marks the block as bad and uses a spare block it has somewhere else in the disk for all subsequent writes. The new block is typically “far” away from the old block so when reading multiple blocks the drive has to now seek to the new block and seek back to read them. increasing response time in the process.

The other phenomona that disk failures have is a head crash. These seem to occur at completely at random with disks from “mature processes”.

So, I believe disks from mature processes have a normalized failure rate with a reasonably wide standard of deviation around this MTBF rate. As such, disk drives should be classified as robust.

… and RAID groups of disk drives are antifragile

But, while disk drives are robust, when you place such devices in a RAID group with others, the RAID groups survive better.  As long as the failure rate of the devices is randomized and there is a wide variance on this failure rate, when a RAID group encounters a single drive failure it is unlikely that a second, or third (RAID DP/6) will also fail while trying to recover from the first.  (Yes, as disk drives get larger the time to recover gets longer thus increasing the probability of multiple drive failures, but absent systematic causes of drive failures, the likelihood of data loss should be rare).

In a past life we had multiple disk systems in a location subject to volcanic activity. Somehow, sulferic fumes from the volcano had found its way into the computer room and was degrading the optical transceivers in our disk drives causing drive failures.   The subsystem at the time had RAID 6 (dual parity) and over the course of a few weeks we had 20 or more disk drives die in these systems. The customer lost no data during this time but only because the disk drive failure rate was randomly distributed over time with a wide dispersion.

So from Nassim’s definition disk RAID groups are anti-fragile, they do operate better with more randomness.

Why SSD and SSD RAID groups are fragile

SSD, Toshiba's New 2.5" SSD from SSD.Toshiba.com
Toshiba’s New 2.5″ SSD from SSD.Toshiba.com

SSDs have a number of good things going for them. For example:

  • They are blistering fast,  at least compared to rotating disks.
  • They are relatively green storage devices meaning they use less energy than rotating disk
  • They are semiconductor devices and as such, are relatively immune to shock and vibration.

Despite all that, given todays propensity to use wear leveling, RAID groups composed of SSDs can exhibit fragility because all the SSDs will fail at approximately the same number of Program/Erase cycles.

My assumption, is that because NAND wear out is essentially an electro-chemical phenomenon that its failure rate, while a normalized distribution, probably has a very narrow variation.  Now given the technology NAND pages will fail after so many writes, it may be 10K, 30K or 100K (for MLC, eMLC, or SLC) but all the NAND pages from the same technology (manufactured on the same fab line) will likely fail at about the same number of P/E cycles. With wear leveling equalizing the P/E cycles across all pages in an SSD, this means that there is some number of writes that an SSD will endure and then go no farther.  (Again, I have no hard statistics to support this presumption and Nasssim will probabilistically not be pleased with me for saying this).

As such, for a RAID group made up of wear leveling SSDs especially with data stripping across the group, all the SSDs will probabilistically fail at almost same time because they all will have had the same amount of data written to them.  This means that as we reach wear out on one SSD in the group, assuming all the others were also fresh at the time of original creation of the group, then all the other devices will be near wear out.  As a result, when one SSD fails, others in the RAID group will have a high probability of failure, leading to data loss.

I have written about this before, see my Potential data loss using SSD RAID groups post for more information.

What we can do about the fragility of SSD RAID groups?

A couple of items come to mind that can be done to reduce the fragility of a RAID group of SSDs:

  • Intermix older and newer (fresher) SSDs in a single RAID group to not cause them all to fail at the same time.
  • Don’t use data striping across RAID groups of SSDs this would allow some devices to be written more than others and by doing so cause some randomness to the SSD failures in the group.
  • Don’t use RAID 1 as this will always cause the same number of writes to be written to pairs of SSDs
  • Don’t use RAID 5 or other protection methodologies that spread parity writes across the group, using these would be akin to data striping in that all parity writes would be spread evenly across the group.
  • Consider using different technology SSDs in a RAID group, if one intermixed MLC, eMLC and SLC drives in a RAID group this would have the effect of varying the SSD failure rates.
  • Move away from wear leveling to defect skipping while doing so will cause some SSDs to fail earlier than today, their failure rate will be randomly distributed.

The last one probably deserves some discussion.  There are many reasons for wear leveling one of which is to speed up writes (by always having a fresh page to write), another is that NAND blocks cannot be updated in place, they need to be erased to be written.  But another major reason is to distribute write activity across all NAND pages to equalize wear out.

In order to speed up writing sans wear leveling one would need some sort of DRAM buffer to absorb the write activity and then later destage it to NAND when available.   The inability to update in place is more problematic but could potentially be dealt with by using the same DRAM cache to read in the previous information and write back the updates.  Other solutions to this later problem exist but seem to be more problematic than they are worth.

But for the aspect of wear leveling done to equalize NAND page wearout, I believe there’s a less fragile solution.  If we were to institute some form of defect skipping with a certain amount of spare NAND pages, we could easily extend the life of an SSD, at least until we run out of spare pages.

Today, there is a considerable amount of spare capacity shipped with most SSDs, over 10% in most enterprise class storage and more with consumer grade. With this much capacity a single NAND logical block could be rewritten an awful high number of times. For instance using defect skipping, with a 100GB MLC SSD at 10K write endurance with 10% spare pages and a 1MB page size, one single logical block address page could written ~100million times (assuming no other pages were being written beyond their maximum).

The main advantage is that, now SSD failure rates would be more widely distributed. Yes there would be more early life failures, especially for SSDs that get hit a lot. But they would no longer fail in unison at some magical write level.

Making SSDs less fragile

While doing all the above may help a RAID group full of SSDs be less fragile, addressing the inherent antifragility of an SSD is more problematic.  Nonetheless, some ideas do come to mind:

  • Randomly mix NAND chips from different FABs/vendors, then the SSDs that use this intermixture could have a more randomly distributed failure rate, which should increase the standard deviation of MTBF.
  • Use different NAND technologies in an SSD, using say MLC for the bulk of the storage capacity and SLC for the defect skip capacity on an SSD (with no wear leveling). Doing this would elongate the lifetime of the average SSD and randomly distribute failures of SSDs based on write locality of reference thereby increasing the standard deviation of MTBF.  Of course this would also have the affect of speeding up heavily written blocks now coming out of SLC rather than slower MLC, making these SSDs even faster for those blocks which are written more frequently.
  • Use more random, less deterministic predictive maintenance, SSD predictive maintenance is used to limit the damage from a failing SSD by replacing it before death. By using less deterministic algorithms and more randomized algorithms  (such as how close to wear out we let the SSD get before signaling failure) we would have the impact of increasing the variance of failure.

This post is almost too long now but there are probably other ideas to increase the robustness of SSDs and PCIe Flash cards that deserve mention someplace. Maybe we can explore these in a subsequent post.

Comments?

[Full disclosure:  I have a number of desktops that use single disk drives (without RAID) that are backed up to other disk drives.  I own and use a laptop, iPads, and an iPhone that all use SSDs or NAND technology (without RAID). I have neither disk or SSD storage subsystems that I own.]

 

Cache appliances rise from the dead

XcelaSAN picture from DataRam.com website
XcelaSAN picture from DataRam.com website
Sometime back in the late 80’s a company I once worked with had a product called the tape accelerator which was nothing more than a ram cache in front of a tape device to smooth out physical tape access. The tape accelerator was a popular product for it’s time, until most tape subsystems started incorporating their own cache to do this.

At SNW in Phoenix this week, I saw a couple of vendors that were touting similar products with a new twist. They had both RAM and SSD cache and were doing this for disk only. DataRAM’s XcelaSAN was one such product although apparently there were at least two others on the floor which I didn’t talk with.

XcelaSAN is targeted for midrange disk storage where the storage subsystems have limited amount’s of cache. Their product is Fibre Channel attached and lists for US$65K per subsystem. Two appliances can be paired together for high availability. Each appliance has 8-4GFC ports on it, with 128GB of DRAM and 360GB of SSD cache.

I talked to them a little about their caching algorithms. They claim to have sequential detect, lookahead and other sophisticated caching capabilities but the proof is in the pudding. It would be great to put this in front of a currently SPC benchmarked storage subsystem and see how much it accelerates it’s SPC-1 or SPC-2 results, if at all.

From my view, this is yet another economic foot race. Most new mid range storage subsystems today ship with 8-16GB of DRAM cache and varied primitive caching algorithms. DataRAM’s appliance has considerably more cache but at these prices it would need to be amortized over a number of mid range subsystems to be justified.

Enterprise class storage subsystems have a lot of RAM cache already, but most use SSDs as storage tier and not a cache tier (except for NetApp’s PAM card). Also, we

  • Didn’t talk much about the reliability of their NAND cache or whether they were using SLC or MLC but these days with workloads approaching 1:1 read:write ratios. IMHO, having some SSD in the system for heavy reads are good but you need RAM for the heavy write workloads.
  • Also what happens when the power fails is yet another interesting question to ask. Most subsystem caches have battery backup or non-volatile RAM sufficient to get data written to RAM out to some more permanent storage like disk. In these appliances perhaps they just write it to SSD.
  • Also what happens when the storage subsystem power fails and the appliance stays up. Sooner or later you have to go back to the storage to retrieve or write the data

In my view, none of these issues are insurmountable but take clever code to get around. Knowing how clever there appliance developers are is hard to judge from the outside. Quality is often as much a factor of testing as it is a factor of development (see my Price of Quality post to learn more on this).

Also, most often caching algorithms are very tailored to the storage subsystem that surrounds it. But this isn’t always necessary. Take IBM SVC or HDS USP-V both of which can add a lot of cache in front of other storage subsystems. But these products also offer storage virtualization which the caching appliances do not provide.

All in all, I feel this is a good direction to take but it’s somewhat time limited until the midrange storage subsystems start becoming more cache intensive/knowledgeable. At that time these products will once again fall into the background. But in the meantime they can have a viable market benefit for the right storage environment.

Sidekick's failure, no backups

Sidekick 2 "Skinit" by grandtlairdjr (cc) (from flickr)
Sidekick 2 "Skinit" by grandtlairdjr (cc) (from flickr)

I believe I have covered this ground before but apparently it needs reiterating. Cloud storage without backup cannot be considered a viable solution. Replication only works well if you never delete or logically erase data from a primary copy. Once that’s done the data is also lost in all replica locations soon afterwards.

I am not sure what happened with the sidekick data, whether somehow a finger check deleted it or some other problem but from what I see looking in from the outside – there were no backups, no offline copies, no fall back copies of the data that weren’t part of the central node and it’s network of replicas. When that’s the case disaster is sure to ensue.

At the moment the blame game is going around to find out who is responsible and I hear that some of the data may be being restored. But that’s not the problem. Having no backups that are not part of the original storage infrastructure/environment are the problem. Replicas are never enough. Backups have to be elsewhere to count as backups.

What would have happened if they had backups is that the duration of the outage would have been the length of time it took to retrieve and restore the data and some customer data would have been lost since the last backup but that would have been it. It wouldn’t be the first time backups had to be used and it won’t be the last. But without backups at all, then you have a massive customer data loss that cannot be recovered from.

This is unacceptable. It gives IT a bad name, puts a dark cloud over cloud computing and storage and makes the IT staff of sidekick/danger look bad or worse incompetent naive.

All of you cloud providers need to take heed. You can do better. Backup software/services can be used to backup this data and we will all be better served because of it.

BBC and others now report that most of the Sidekick data will be restored. I am glad that they found a way to recover their “… data loss in the core database and the back up.” and have “… installed a more resilient back-up process” for their customer data.

Some are saying that the backups just weren’t accessible but until the whole story comes out I will withhold judgement. Just glad to have another potential data loss be prevented.

The price of quality

At HPTechDay this week we had a tour of the EVA test lab, in the south building of HP’s Colorado Springs Facility. I was pretty impressed and I have seen more than my fair share of labs in my day.

Tony Green HP's EVA Lab Manager
Tony Green HP's EVA Lab Manager
The fact that they have 1200 servers and 500 EVA arrays was pretty impressive but they also happen to have about 20PB of storage over that 500 arrays. In my day a couple of dozen arrays and a 100 or so servers seemed to be enough to test a storage subsystem.

Nowadays it seems to have increased by an order of magnitude. Of course they have sold something like 70,000 EVAs over the years and some of these 500 arrays happen to be older subsystems used to validate problems and debug issues for current field population.

Another picture of the EVA lab with older EVAs
Another picture of the EVA lab with older EVAs

They had some old Compaq equipment there but I seem to have flubbed the picture of that equipment. This one will have to suffice. It seems to have both vertically and horizontally oriented drive shelves. I couldn’t tell you which EVAs these were but as they were earlier in the tour, I figured they were older equipment. It seemed as you got farther into the tour you moved closer to the current iterations of EVA. It seemed like an archive dig in reverse instead of having the most current layers/levels first they were last.

I asked Tony how many FC ports he had and he said it was probably easiest to count the switch ports and double them but something in the thousands seemed reasonable.

FC switch rack with just a small selection of switch equipment
FC switch rack with just a small selection of switch equipment

There were parts of the lab which were both off limits to cameras and to bloggers which was deep into the bowels of the lab. But we were talking about some of the remote replication support that EVA had and how they tested this over distance. Tony said they had to ship their reel of 100 miles of FC up north (probably for some other testing) but he said they have a surragate machine which can be programmed to create the proper FC delay to meet any required distances.

FC delay generator box
FC delay generator box

The blue box in the adjacent picture seemed to be this magic FC delay inducer box. Had interesting lights on it.

Nigel Poulton of Ruptured Monkeys and Devang Panchigar of StorageNerve Blog were also on the tour taking pictures&video. You can barely make out Devang in the picture next to Nigel. Calvin Zito from HP StorageWorks Blog was also on tour but not in any of my pictures.

Nigel and Devang (not pictured) taking videos on EVA lab tour
Nigel and Devang (not pictured) taking videos on EVA lab tour

Throughout our tour of the lab I can say I only saw one logic analyzer although I am sure there were plenty more in the off limits area.

Lonely logic analyzer in EVA lab
Lonely logic analyzer in EVA lab
During HPTechDay they hit on the topic of storage-server convergence and the use of commodity, X86 hardware for future storage systems. From the lack of logic analyzers I would have to concur with this analysis.

Nonetheless, I saw some hardware workstations although this was another lonely workstation sorrounded in a sea of EVAs.

Hardware workstation in the EVA lab, covered in parts and HW stuff
Hardware workstation in the EVA lab, covered in parts and HW stuff
Believe it or not I actually saw one stereo microscope but failed to take a picture of it. Yet another indicator of hardware descent and my inadequacies as a photographer.

One picture of an EVA obviously undergoing some error injection test with drives tagged as removed and being rebuilt or reborn as part of RAID testing.

Drives tagged for removal during EVA test
Drives tagged for removal during EVA test
In my day we would save particularly “squirrelly drives” from the field and use them to verify storage subsystem error handling. I would bet anything these tagged drives had specific error injection points used to validate EVA drive error handling.

I could go on and I have a couple of more decent lab pictures but you get the jist of the tour.

For some reason I enjoy lab tours. You can tell a lot about an organization by how their labs look, how they are manned, organized and set up. What HP’s EVA lab tells me is that they spare no expense to insure their product is literally bulletproof, bug proof, and works every time for their customer base. I must say I was pretty impressed.

At the end of HPTechDay event Greg Knieriemen of Storage Monkeys and Stephen Foskett of GestaltIT hosted an InfoSmack podcast to be broadcast next Sunday 10/4/2009. There we talked a little more on commodity hardware versus purpose built storage subsystem hardware, it was a brief, but interesting counterpoint to the discussions earlier in the week and the evidence from our portion of the lab tour.