The problem with SSDs is that they typically all fail at some level of data writes, called the write endurance specification.
As such, if you purchase multiple drives from the same vendor and put them in a RAID group, sometimes this can cause multiple failures because of this.
Say the SSD write endurance is 250TBs (you can only write 250TB to the SSD before write failure), and you populate a RAID 5 group with them in a 3 -data drives + 1-parity drive configuration. As, it’s RAID 5, parity rotates around to each of the drives sort of equalizing the parity write activity.
Now every write to the RAID group is actually two writes, one for data and one for parity. Thus, the 250TB of write endurance per SSD, which should result in 1000TB write endurance for the RAID group is reduced to something more like 125TB*4 or 500TB. Specifically,
- Each write to a RAID 5 data drive is replicated to the RAID 5 parity drive,
- As each parity write is written to a different drive, the parity drive of the moment can contain at most 125TB of data writes and 125TB of parity writes before it uses up it’s write endurance spec.
- So for the 4 drive raid group we can write 500TB of data, evenly spread across the group can no longer be written.
As for RAID 6, it almost looks the same except that you lose more SSD life, as you write parity twice. E.g. for a 6 data drive + 2 parity drive RAID 6 group, with similar write endurance, you should get 83.3TB of data writes and 166.7TB of parity writes per drive. Which for an 8 drive parity group is 666.4TB of data writes before RAID group write endurance lifetime is used up.
For RAID 1 with 2 SSDs in the group, as each drive mirrors writes to the other drive, you can only get 125TB per drive or 250TB total data writes per RAID group.
But the “real” problem is much worse
If I am writing the last TB to my RAID group and if I have managed to spread the data writes evenly across the RAID group, one drive will go out right away. Most likely the current parity drive will throw a write error. BUT the real problem occurs during the rebuild.
- With a 256GB SSD in the RAID 5 group, with 100MB/s read rate, reading the 3 drives in parallel to rebuild the fourth will take ~43 minutes. However that means all the good SSDs are idle except for rebuild IO. Most systems limit drive rebuild IO to no more than 1/2 to 1/4 of the drive activity (possibly much less) in the RAID group. As such, a more realistic rebuild time can be anywhere from 86 to 169 minutes or more.
- Now because rebuild time takes a long time, data must continue to be written to the RAID group. But as we are aware, most of the remaining drives in the RAID group are likely to be at the end of their write endurance already.
- Thus, it’s quite possible that another SSD in the RAID group will fail while the first drive is rebuilt.
Resulting in a catastrophic data loss (2 bad drives in a RAID 5, 3 drives in a RAID 6 group).
RAID 1 groups with SSDs are probably even more prone to this issue. When the first drive fail, the second should follow closely behind.
Yes, but is this probable
First we are talking TBs of data here. The likelihood that a RAID groups worth of drives would all have the same amount of data written to them even within a matter of hours of rebuild time is somewhat unlikely. That being said, the lower the write endurance of the drives, the more equal the SSD write endurance at the creation of the RAID group, and the longer it takes to rebuild failing SSDs, the higher the probability of this type of castastrophic failure.
In any case, the problem is highly likely to occur with RAID 1 groups using similar SSDs as the drives are always written in pairs.
But for RAID 5 or 6, it all depends on how well data striping across the RAID group equalizes data written to the drives.
For hard disks this was a good thing and customers or storage systems all tried to equalize IO activity across drives in a RAID group. So with good (manual or automated) data striping the problem is more likely.
Automated storage tiering using SSDs is not as easy to fathom with respect to write endurance catastrophes. Here a storage system automatically moves the hottest data (highest IO activity) to SSDs and the coldest data down to hard disks. In this fashion, they eliminate any manual tuning activity but they also attempt to minimize any skew to the workload across the SSDs. Thus, automated storage tiering, if it works well, should tend to spread the IO workload across all the SSDs in the highest tier, resulting in similar multi-SSD drive failures.
However, with some vendor’s automated storage tiering, the data is actually copied and not moved (that is the data resides both on disk and SSD). In this scenario losing an SSD RAID group or two might severely constrain performance, but does not result in data loss. It’s hard to tell which vendors do which but customers can should be able to find out.
So what’s an SSD user to do
- Using RAID 4 for SSDs seems to make sense. The reason we went to RAID 5 and 6 was to avoid hot (parity write) drive(s) but with SSD speeds, having a hot parity drive or two is probably not a problem. (Some debate on this, we may lose some SSD performance by doing this…). Of course the RAID 4 parity drive will die very soon, but paradoxically having a skewed workload within the RAID group will increase SSD data availability.
- Mixing SSDs age within RAID groups as much as possible. That way a single data load level will not impact multiple drives.
- Turning off LUN data striping within a SSD RAID group so data IO can be more skewed.
- Monitoring write endurance levels for your SSDs, so you can proactively replace them long before they will fail
- Keeping good backups and/or replicas of SSD data.
I learned the other day that most enterprise SSDs provide some sort of write endurance meter that can be seen at least at the drive level. I would suggest that all storage vendors make this sort of information widely available in their management interfaces. Sophisticated vendors could use such information to analyze the SSDs being used for a RAID group and suggest which SSDs to use to maximize data availability.
But in any event, for now at least, I would avoid RAID 1 using SSDs.