Potential data loss using SSD RAID groups

Ultrastar SSD400 4 (c) 2011 Hitachi Global Storage Technologies (from their website)
Ultrastar SSD400 4 (c) 2011 Hitachi Global Storage Technologies (from their website)

The problem with SSDs is that they typically all fail at some level of data writes, called the write endurance specification.

As such, if you purchase multiple drives from the same vendor and put them in a RAID group, sometimes this can cause multiple failures because of this.

Say the SSD write endurance is 250TBs (you can only write 250TB to the SSD before write failure), and you populate a RAID 5 group with them in a 3 -data drives + 1-parity drive configuration.  As, it’s RAID 5, parity rotates around to each of the drives sort of equalizing the parity write activity.

Now every write to the RAID group is actually two writes, one for data and one for parity.  Thus, the 250TB of write endurance per SSD, which should result in 1000TB write endurance for the RAID group is reduced to something more like 125TB*4 or 500TB.  Specifically,

  • Each write to a RAID 5 data drive is replicated to the RAID 5 parity drive,
  • As each parity write is written to a different drive, the parity drive of the moment can contain at most 125TB of data writes and 125TB of parity writes before it uses up it’s write endurance spec.
  • So for the 4 drive raid group we can write 500TB of data, evenly spread across the group can no longer be written.

As for RAID 6, it almost looks the same except that you lose more SSD life, as you write parity twice.  E.g. for a 6 data drive + 2 parity drive RAID 6 group, with similar write endurance, you should get 83.3TB of data writes and 166.7TB of parity writes per drive. Which for an 8 drive parity group is 666.4TB of data writes before RAID group write endurance lifetime is used up.

For RAID 1 with 2 SSDs in the group, as each drive mirrors writes to the other drive, you can only get 125TB per drive or 250TB total data writes per RAID group.

But the “real” problem is much worse

If I am writing the last TB to my RAID group and if I have managed to spread the data writes evenly across the RAID group, one drive will go out right away.  Most likely the current parity drive will throw a write error. BUT the real problem occurs during the rebuild.

  • With a 256GB SSD in the RAID 5 group, with 100MB/s read rate, reading the 3 drives in parallel to rebuild the fourth will take ~43 minutes.  However that means all the good SSDs are idle except for rebuild IO.  Most systems limit drive rebuild IO to no more than 1/2 to 1/4 of the drive activity (possibly much less) in the RAID group.  As such, a more realistic rebuild time can be anywhere from 86 to 169 minutes or more.
  • Now because rebuild time takes a long time, data must continue to be written to the RAID group.   But as we are aware, most of the remaining drives in the RAID group are likely to be at the end of their write endurance already.
  • Thus, it’s quite possible that another SSD in the RAID group will fail while the first drive is rebuilt.

Resulting in a catastrophic data loss (2 bad drives in a RAID 5, 3 drives in a RAID 6 group).

RAID 1 groups with SSDs are probably even more prone to this issue. When the first drive fail, the second should follow closely behind.

Yes, but is this probable

First we are talking TBs of data here. The likelihood that a RAID groups worth of drives would all have the same amount of data written to them even within a matter of hours of rebuild time is somewhat unlikely. That being said, the lower the write endurance of the drives, the more equal the SSD write endurance at the creation of the RAID group, and the longer it takes to rebuild failing SSDs, the higher the probability of this type of castastrophic failure.

In any case, the problem is highly likely to occur with RAID 1 groups using similar SSDs as the drives are always written in pairs.

But for RAID 5 or 6, it all depends on how well data striping across the RAID group equalizes data written to the drives.

For hard disks this was a good thing and customers or storage systems all tried to equalize IO activity across drives in a RAID group. So with good (manual or automated) data striping the problem is more likely.

Automated storage tiering using SSDs is not as easy to fathom with respect to write endurance catastrophes.  Here a storage system automatically moves the hottest data (highest IO activity) to SSDs and the coldest data down to hard disks.  In this fashion, they eliminate any manual tuning activity but they also attempt to minimize any skew to the workload across the SSDs. Thus, automated storage tiering, if it works well, should tend to spread the IO workload across all the SSDs in the highest tier, resulting in similar multi-SSD drive failures.

However, with some vendor’s automated storage tiering, the data is actually copied and not moved (that is the data resides both on disk and  SSD).  In this scenario losing an SSD RAID group or two might severely constrain performance, but does not result in data loss.   It’s hard to tell which vendors do which but customers can should be able to find out.

So what’s an SSD user to do

  • Using RAID 4 for SSDs seems to make sense.  The reason we went to RAID 5 and 6 was to avoid hot (parity write) drive(s) but with SSD speeds, having a hot parity drive or two is probably not a problem. (Some debate on this, we may lose some SSD performance by doing this…). Of course the RAID 4 parity drive will die very soon, but paradoxically having a skewed workload within the RAID group will increase SSD data availability.
  • Mixing SSDs age within RAID groups as much as possible. That way a single data load level will not impact multiple drives.
  • Turning off LUN data striping within a SSD RAID group so data IO can be more skewed.
  • Monitoring write endurance levels for your SSDs, so you can proactively replace them long before they will fail
  • Keeping good backups and/or replicas of SSD data.

I learned the other day that most enterprise SSDs provide some sort of write endurance meter that can be seen at least at the drive level.  I would suggest that all storage vendors make this sort of information widely available in their management interfaces.  Sophisticated vendors could use such information to analyze the SSDs being used for a RAID group and suggest which SSDs to use to maximize data availability.

But in any event, for now at least, I would avoid RAID 1 using SSDs.

Comments?

What eMLC and eSLC do for SSD longevity

Enterprise NAND from Micron.com (c) 2010 Micron Technology, Inc.
Enterprise NAND from Micron.com (c) 2010 Micron Technology, Inc.

I talked last week with some folks from Nimbus Data who were discussing their new storage subsystem.  Apparently it uses eMLC (enterprise Multi-Level Cell) NAND SSDs for its storage and has no SLC (Single Level Cell) NAND at all.

Nimbus believes with eMLC they can keep the price/GB down and still supply the reliability required for data center storage applications.  I had never heard of eMLC before but later that week I was scheduled to meet with Texas Memory Systems and Micron Technologies that helped get me up to speed on this new technology.

eMLC/eSLC defined

eMLC and its cousin, eSLC are high durability NAND parts which supply more erase/program cycles than generally available from MLC and SLC respectively.  If today’s NAND technology can supply 10K erase/program cycles for MLC and similarly, 100K erase/program cycles for SLC then, eMLC can supply 30K.  Never heard a quote for eSLC but 300K erase/program cycles before failure might be a good working assumption.

The problem is that NAND wears out, and can only sustain so many erase/program cycles before it fails.  By having more durable parts, one can either take the same technology parts (from MLC to eMLC) to use them longer or move to cheaper parts (from SLC to eMLC) to use them in new applications.

This is what Nimbus Data has done with eMLC.  Most data center class SSD or cache NAND storage these days are based on SLC. But SLC, with only on bit per cell, is very expensive storage.  MLC has two (or three) bits per cell and can easily halve the cost of SLC NAND storage.

Moreover, the consumer market which currently drives NAND manufacturing depends on MLC technology for cameras, video recorders, USB sticks, etc.  As such, MLC volumes are significantly higher than SLC and hence, the cost of manufacturing MLC parts is considerably cheaper.

But the historic problem with MLC NAND is the reduction in durability.  eMLC addresses that problem by lengthening the page programming (tProg) cycle which creates a better, more lasting data write, but slows write performance.

The fact that NAND technology already has ~5X faster random write performance than rotating media (hard disk drives) makes this slightly slower write rate less of an issue. If eMLC took this to only ~2.5X disk writes it still would be significantly faster.  Also, there are a number of architectural techniques that can be used to speed up drive write speeds easily incorporated into any eMLC SSD.

How long will SLC be around?

The industry view is that SLC will go away eventually and be replaced with some form of MLC technology because the consumer market uses MLC and drives NAND manufacturing.  The volumes for SLC technology will just be too low to entice manufacturers to support it, driving the price up and volumes even lower – creating a vicious cycle which kills off SLC technology.  Not sure how much I believe this, but that’s conventional wisdom.

The problem with this prognosis is that by all accounts the next generation MLC will be even less durable than today’s generation (not sure I understand why but as feature geometry shrinks, they don’t hold charge as well).  So if today’s generation (25nm) MLC supports 10K erase/program cycles, most assume the next generation (~18nm) will only support 3K erase/program cycles. If eMLC then can still support 30K or even 10K erase/program cycles that will be a significant differentiator.

—-

Technology marches on.  Something will replace hard disk drives over the next quarter century or so and that something is bound to be based on transistorized logic of some kind, not the magnetized media used in disks today. Given todays technology trends, it’s unlikely that this will continue to be NAND but something else will most certainly crop up – stay tuned.

Anything I missed in this analysis?