Potential data loss using SSD RAID groups

Ultrastar SSD400 4 (c) 2011 Hitachi Global Storage Technologies (from their website)
Ultrastar SSD400 4 (c) 2011 Hitachi Global Storage Technologies (from their website)

The problem with SSDs is that they typically all fail at some level of data writes, called the write endurance specification.

As such, if you purchase multiple drives from the same vendor and put them in a RAID group, sometimes this can cause multiple failures because of this.

Say the SSD write endurance is 250TBs (you can only write 250TB to the SSD before write failure), and you populate a RAID 5 group with them in a 3 -data drives + 1-parity drive configuration.  As, it’s RAID 5, parity rotates around to each of the drives sort of equalizing the parity write activity.

Now every write to the RAID group is actually two writes, one for data and one for parity.  Thus, the 250TB of write endurance per SSD, which should result in 1000TB write endurance for the RAID group is reduced to something more like 125TB*4 or 500TB.  Specifically,

  • Each write to a RAID 5 data drive is replicated to the RAID 5 parity drive,
  • As each parity write is written to a different drive, the parity drive of the moment can contain at most 125TB of data writes and 125TB of parity writes before it uses up it’s write endurance spec.
  • So for the 4 drive raid group we can write 500TB of data, evenly spread across the group can no longer be written.

As for RAID 6, it almost looks the same except that you lose more SSD life, as you write parity twice.  E.g. for a 6 data drive + 2 parity drive RAID 6 group, with similar write endurance, you should get 83.3TB of data writes and 166.7TB of parity writes per drive. Which for an 8 drive parity group is 666.4TB of data writes before RAID group write endurance lifetime is used up.

For RAID 1 with 2 SSDs in the group, as each drive mirrors writes to the other drive, you can only get 125TB per drive or 250TB total data writes per RAID group.

But the “real” problem is much worse

If I am writing the last TB to my RAID group and if I have managed to spread the data writes evenly across the RAID group, one drive will go out right away.  Most likely the current parity drive will throw a write error. BUT the real problem occurs during the rebuild.

  • With a 256GB SSD in the RAID 5 group, with 100MB/s read rate, reading the 3 drives in parallel to rebuild the fourth will take ~43 minutes.  However that means all the good SSDs are idle except for rebuild IO.  Most systems limit drive rebuild IO to no more than 1/2 to 1/4 of the drive activity (possibly much less) in the RAID group.  As such, a more realistic rebuild time can be anywhere from 86 to 169 minutes or more.
  • Now because rebuild time takes a long time, data must continue to be written to the RAID group.   But as we are aware, most of the remaining drives in the RAID group are likely to be at the end of their write endurance already.
  • Thus, it’s quite possible that another SSD in the RAID group will fail while the first drive is rebuilt.

Resulting in a catastrophic data loss (2 bad drives in a RAID 5, 3 drives in a RAID 6 group).

RAID 1 groups with SSDs are probably even more prone to this issue. When the first drive fail, the second should follow closely behind.

Yes, but is this probable

First we are talking TBs of data here. The likelihood that a RAID groups worth of drives would all have the same amount of data written to them even within a matter of hours of rebuild time is somewhat unlikely. That being said, the lower the write endurance of the drives, the more equal the SSD write endurance at the creation of the RAID group, and the longer it takes to rebuild failing SSDs, the higher the probability of this type of castastrophic failure.

In any case, the problem is highly likely to occur with RAID 1 groups using similar SSDs as the drives are always written in pairs.

But for RAID 5 or 6, it all depends on how well data striping across the RAID group equalizes data written to the drives.

For hard disks this was a good thing and customers or storage systems all tried to equalize IO activity across drives in a RAID group. So with good (manual or automated) data striping the problem is more likely.

Automated storage tiering using SSDs is not as easy to fathom with respect to write endurance catastrophes.  Here a storage system automatically moves the hottest data (highest IO activity) to SSDs and the coldest data down to hard disks.  In this fashion, they eliminate any manual tuning activity but they also attempt to minimize any skew to the workload across the SSDs. Thus, automated storage tiering, if it works well, should tend to spread the IO workload across all the SSDs in the highest tier, resulting in similar multi-SSD drive failures.

However, with some vendor’s automated storage tiering, the data is actually copied and not moved (that is the data resides both on disk and  SSD).  In this scenario losing an SSD RAID group or two might severely constrain performance, but does not result in data loss.   It’s hard to tell which vendors do which but customers can should be able to find out.

So what’s an SSD user to do

  • Using RAID 4 for SSDs seems to make sense.  The reason we went to RAID 5 and 6 was to avoid hot (parity write) drive(s) but with SSD speeds, having a hot parity drive or two is probably not a problem. (Some debate on this, we may lose some SSD performance by doing this…). Of course the RAID 4 parity drive will die very soon, but paradoxically having a skewed workload within the RAID group will increase SSD data availability.
  • Mixing SSDs age within RAID groups as much as possible. That way a single data load level will not impact multiple drives.
  • Turning off LUN data striping within a SSD RAID group so data IO can be more skewed.
  • Monitoring write endurance levels for your SSDs, so you can proactively replace them long before they will fail
  • Keeping good backups and/or replicas of SSD data.

I learned the other day that most enterprise SSDs provide some sort of write endurance meter that can be seen at least at the drive level.  I would suggest that all storage vendors make this sort of information widely available in their management interfaces.  Sophisticated vendors could use such information to analyze the SSDs being used for a RAID group and suggest which SSDs to use to maximize data availability.

But in any event, for now at least, I would avoid RAID 1 using SSDs.

Comments?

10 thoughts on “Potential data loss using SSD RAID groups

  1. You nailed it when you said: the lower the write endurance and the longer it takes to rebuild, the higher probability of this type of failure. However, when it comes to enterprise SSDs, you may be underestimating the write endurance…

    Take the Hitachi Ultrastar SSD400S pictured as an example: according to their website the 400GB drive has a write endurance of 35 PB.
    (http://www.hitachigst.com/solid-state-drives/ultrastar-ssd400s)

    Texas Memory Systems' RamSan-630 with 10TB has a write endurance of 2,750 PB.
    (http://www.ramsan.com/products/rackmount-flash-storage/ramsan-630)

    Thanks!

    1. Phillip, Thanks for your comment. Yes the HGST drive has a write endurance of 35PB and the TMS RamSan-630 10TB system has a write endurance of 2.8EB but the issues still exist, just at higher levels. The HGST has smaller drive capacities with correspondingly lesser write endurance levels and many prior generation enterprise class drives (shipped and installed today) have considerably less write endurance than these leading edge devices. It's only recently that I have even seen write endurance specifications published externally for SSDs. Nonetheless, the problem exists because SSDs have a narrower reliability (or MTBF) variance than disk drives. Paradoxically, for RAID groups to work well you want storage that has high variability in MTBF. As such, any storage with low MTBF variability (such as SSDs) should only be included in RAID groups with some due consideration of these issues.Ray

  2. Thank you for the interesting article.. I think one aspect that needs to be considered here is what it means for a SSD to "fail". Write endurance EOL in SSD's is quite different from a HDD failure which typically is catastrophic resulting in data loss without redundancy. On the flip side, SSD's actually are often turned into read-only bricks. The data is still there, but it can't accept any more writes without risking data loss. There are other caveats like data retention after EOL, but in general this all can be managed carefully by the SSD controller. In that case, maybe your RAID 5 or 6 data cache does start failing >1 within a time frame, but the data isn't lost. Of course, application / system handling of the SSD's going ROM like the write caching D mentioned above would also be important! 😉

    It would seem to me that maybe some online or even offline spares would be maybe the best remedy for any early failures. My only concern for RAID 4 would be a question of performance differences. Any feedback / experience on how performance compares on a RAID4 vs. RAID5 with set of maybe 8-16 drives?

    Other than handling early life failures or wearout, SSD write endurance certainly needs to be taken into account vs. the required useful life of the storage array. Obviously provisions like RAID need to be made to prevent data loss, but I think handling the write endurance of SSD's is a new and different problem set from that addressed by RAID.

  3. The amount of writes an SSD can accept before it has exhausted all of its internal overprovisioning and spare blocks is an estimation and not an absolute. For MOST (but not all) SSDs, the actual lifespan is dependent upon wear-out at the rank, chip, or segment within the chip (when the target NAND cells can no longer retain data, the address range they support are remapped into spare NAND).

    As a result, the wearing of drives is not an absolute – in fact, given manufacturing variability of NAND chips, it is highly unlikely that a given set of drives will run out of spare tracks simultaneously, or necessarily even within the same 2 hours. (Note that hard disk drives also remap bad blocks, and although the predicted error rate for each drive is the same, it is rare that multiple disk drives cease accepting writes at the same time either).

    The “usage guage” that many enterprise SSDs support (including all the SSDs that EMC uses) tracks average program/erase cycles per cell AND the amount of spare capacity that remains available for remapping of bad blocks. VMAX (as an example) tracks this usage guage and will initiate the hot sparing process and a request for SSD replacement long before the spare capacity within the SSD is even close to fully utilized. It is also standard service practice to review the usage guage on other SSDs in the array on every SSD replacement and to initiate replacement requests on any other “almost used up” SSDs.

    This usage guage is also available on many consumer-grade SSDs, and there is a standard SMART query that reports the “used” level of the SSD. There are many freeware utilities that will report this usage statistic for SSDs under Windows. Given that the drives will indeed ultimately wear out all of the spare capacity, users should monitor the usage of their SSDs and take action to replace them before the remaining capacity falls below 5 or 10% – dependent upon the wear rate observed on the drives.

    As a reference point, on my high performance home workhorse that uses an Intel X25M 160GB drive for boot, log, temp, Program Files and swap space, I have observed that the remaining spare capacity has fallen from 100% to 99% in 6 months – if my usage remains linear, I expect the SSD to easily outlast the system’s useful life.

    1. Storageanarchy

      Thanks for the insightful comment. I think the VMAX retiring SSDs before wear-out is another great solution to this problem. And having service people, watch out for near end-of-life SSDs is also noteworthy.

      Yes, not all NAND chips from the same vendor will fail at the same exact P/E cycle but now it's a question of variability. I assume that the variance in a SSD write endurance failures is much narrower than disk drives failures (although I know of no published information to support this). The fact that the wear out is indeed a mechanical process inside the chip flash cell is probably a good sign as this would indicate that the variance may not be that narrow. But the counter trend is that smaller Flash cell geometries mean finer mechanical tolerances and the variance of P/E cycle failure would seem to get narrower, not wider (again an assumption, with little empirical data to support this).

      It seems to me that somebody (NAND manufacturers, SSD suppliers, or storage subsystem vendors) should be interested in sponsoring some research to statistically describe write endurance and put this discussion to rest.

      The usage gauge is something that I have only learned about recently. But, I have a MacBook AIR with 128GB of SSD, and I don't see any system level indicator that shows wearout.

  4. Dimitris,

    Thanks for the comment. There have been many comments questioning what the variance in SSD write endurance looks like. At this point it's all conjecture. But my whole discussion is on the premise that whatever it is it is much narrower than disk.

    I wasn't aware that RAID-DP was sort of a dual RAID-4, and as such would seem to handle SSDs just fine.

    I am unfamiliar with anyone that uses SSDs as only a write cache. It would seem to me to be problematic at best but it may depend on the I/O load.

    Ray

  5. The others can say what they like but this is exactly what happened to us last week. 1 drive lost, 5 more lost during the rebuild cycle. None of them can be recognised even as individual drives on a new controller. Not just a theoretica risk – it happened to us.

    1. John,Thanks for your comment. Yes it can happen, the question you should be asking is why didn't anyone see this sooner and eliminate the problem from occurring. Hope you had backups…Ray

Comments are closed.