
Back at SFD10 a couple of weeks back now when visiting with Nimble Storage they mentioned that their latest all flash storage array was going to support triple-parity RAID.
And last week at a NetApp-SolidFire analyst event, someone mentioned that the new ONTAP 9 triple parity RAID-TEC™ for larger SSDs. Also heard at the meeting was that a 15.3TB SSD would take on the order of 12 hours to rebuild.
Need for better protection
When Nimble discussed the need for triple parity RAID they mentioned the report from Google I talked about recently (see my Surprises from 4 years of SSD experience at Google post). In that post, the main surprise was the amount of read errors they had seen from the SSDs they deployed throughout their data center.
I think the need for triple-parity RAID and larger (+15TB SSDs) will become more common over time. There’s no reason to think that the SSD vendors will stop at 15TB. And if it takes 12 hours to rebuild a 15TB one, I think it’s probably something like ~30 hours to rebuild a 30TB one, which is just a generation or two away.
A read error on one SSD in a RAID group during an SSD rebuild can be masked by having dual parity. A read error on two SSDs can only be masked by having triple parity RAID.
Likelihood of a 2nd error is rising
What’s the likelihood of having two read errors in a RAID group during a 12-hour rebuild? Probably not that high but, if there are more read errors for SSDs in general then, there’s going to be at least more partial rebuilds. And with 15 no 30TB SSDs, there’s not going to be a whole lot of SSDs in your typical storage system anymore. So having one, large parity group, with triple parity might make a lot of sense.
The mathematics are beyond me to figure out actual failure rates but with a higher frequency of read errors and longer rebuild times (due to larger SSDs), triple parity makes sense. If not today with 15TB SSDs, then the next generation of 3D NAND SSDs may make it mandatory.
Moreover, I am aware of recent research (see my Better erasure coding … post) that indicates rebuild activity may be even more prevalent in the future, being used for slow SSD access as well as data errors. So having more parity would make sense if you were rebuilding blocks more often….
Triple parity seems here to stay…
Comments?
Has triple parity Raid time come? https://t.co/IoKLcfoYAw https://t.co/SSAHOeu8Lb
RT @RayLucchesi: Has triple parity Raid time come? https://t.co/IoKLcfoYAw https://t.co/SSAHOeu8Lb
RT @RayLucchesi: [Blog post] Has triple parity Raid time come? https://t.co/jGn9YXaC4T #SFD10 @NimbleStorage
@RayLucchesi @NimbleStorage With 16TB SSDs now and 32TB coming soon, I vote YES! 🙂
RT @RayLucchesi: [Blog post] Has triple parity Raid time come? https://t.co/jGn9YXaC4T #SFD10 @NimbleStorage
Feedbin star: Has triple parity Raid time come? https://t.co/hxzTECT9R9
RT @RayLucchesi: Has triple parity Raid time come? https://t.co/IoKLcfoYAw https://t.co/SSAHOeu8Lb
RT @RayLucchesi: Has triple parity Raid time come? https://t.co/IoKLcfoYAw https://t.co/SSAHOeu8Lb
“A read error on one SSD in a RAID group during an SSD rebuild can be masked by having dual parity A read error on two SSDs can only be masked by having triple parity RAID”
What is MTTDL with RAID6? That’s sort of an aside. Triple-parity is for broken RAID implementations, i.e. typical legacy RAID6. The problems are BER of 1^16 on Enterprise SATA and large RAID arrays percentage of bad block “hit” due to BER on rebuild is now a concern. But look at what SSD / flash is now, BER of 1^17, so what’s the odds of BER on RAID6 disk failure? Almost non-existent, right? I’ll come back with the math if you want. Robin Harris penned that RAID6 stops working in 2019: http://bit.ly/1XHToX1. Funny thing was on his run-up to Infinidat, before we knew much about RAID levels in play there, I had been peaking at patents and knew it was RAID6 “implementation” (but much more than that, eh?) he was going on about it having to be triple parity and I dropped that I’ll bet it was RAID6. He then pens his ZDnet article pointing out Infinibox is a RAID6 implementation but didn’t go back and cleanup his storagemojo article. But to answer your question… is triple parity necessary? In certain situations yes, for broken/legacy RAID implementations that aren’t all flash arrays. But even if they weren’t doing specialized RAID levels like XtremIO, Pure a traditional RAID6 in AFA should be not big deal although a back and forth with a Nimble resource… he seems to think it is – I gave up on him. Finally, one last anecdote. At a training session a former Sun engineer mentioned why they introed triple-parity in ZFS. Large HPC installs. He mentioned they actually lost 2 or 3 RAID6 arrays over the course of several years. Not good obviously. Rather historical though. If you have disks (go back 7 years) with BER of 1^15 and several thousand disks on the floor doing RAID6, it is only a matter of time before one of them blows out on rebuild.
Rob, thanks for your comment. I am aware of RAID6 data losses with disk. It happens, especially if the storage has a large enough install base and is widely available. Sulfuric emissions from nearby volcano’s can have bad impacts on optical components. That being said. RAID6 and triple parity RAID (triple level erasure coding) technologies are all technological solutions to device errors. The latest Google report on flash drive reliability (see my earlier post) in their data centers seem to imply that read errors are more of a problem on SSDs than disks. If I take that as a given, there is a real need for triple level RAID/erasure coding as SSDs get bigger and rebuild times lengthen. Can you speed up rebuild, yes with wide stripping and other techniques and this all will help. But in the end, you can only write 15TB so fast and it will take time. So I think the technology is here to stay as long as read error rates persist and SSDs keep getting larger.
Ray
I saw google’s report before. Spelling out why triple parity “might” be necessary for rather lame double-parity (raid6) implementations going forward is what I was driving at. Trevor Pott does a good roll-up of BER: http://bit.ly/28J6VcA. Enterprise SSD has BER of 1^17, meaning a bad bit every 12.5 PB or so. Lose a drive in a traditional 4+1 SSD RAID5 with 10 TB SSDs, you have a 1 in 327680 chance of hitting a bad block on rebuild. I’m not running through all the combos on these. The real issue is the RAID6 implementation that is 30+2 (or large numbers), using Enterprise SATA – a legacy RAID6 if you will that can go big (not others that *only* offer 4+2 or 6+2 as choices). BER is 1^16, a bad bit every 1.25 PB, if those were 10 TB SSD you have a 1 in 4228 of hitting a bad block on rebuild (rebuilding 31 – 10 TB drives) but then you have the SAME odds of hitting a second block on rebuild, a tiny tiny bit of a concern, hence they are covering themselves with triple parity. Larger traditional 6+2 SATA rebuilds have a lot less exposed on rebuild , hitting a single block/bit on rebuild is 1 in 21,854 or so to hit a second bad block is not worth worrying about. So I only agree triple parity is necessary for folks that are building very large traditional R6 with Enterprise SATA and even then the odds are very low.. they are no doubt continuing this forward as SATA goes to 20 TB and 30 TB, (the 1 in 4228 for 10 TB is 1 in 2114) everyone else either using all SSD OR moving beyond legacy R6 isn’t at risk. That is what I was driving at. I guess a handful of vendors apply? By the way and back to the issue of “less reliable” SSD .. almost all these vendors now a days are proactively kicking disks to the curb on more than a single error, will swap in a fresh disk , take that disk offline, beat on it without mercy and flag it as bad if it rings errors. Google finding “bad” ssds in a sense doesn’t apply, those would most likely be kicked to the curb?
Rob, thanks for your reply. In reading the Google paper it says they are seeing a non-transparent error on their SSDs “…between 2-6 out of every 1000 drive days…”. Now Google may be transferring lot’s of data during those 1000 days. That being the case, this means one non-transparent SSD error every 167 to 500 drive days. Non-transparent errors include final read error, uncorrectable error, final write error, meta error, timeout error, and response errors, with final read error and uncorrectable error being the dominant errors in their experience. This is extremely disturbing if true. According to Google’s experience, a customer with 167 SSDs should see, on average one non-transparent error every day, if they are transferring data at the rates Google is.
They are plenty of AFA customers that have 170-500 drives, which means if they transfer data at Google’s rates, they will be seeing one non-transparent error every day. Now if this occurs during a rebuild in a RAID 5, there’s going to be data loss. Is RAID 6 a proper answer to this, perhaps, but if rebuild times start exceeding one day, there’s a high likelihood of another non-transparent error during this rebuild. If both of these non-transparent errors occur inside the RAID group with the rebuild going on, there’s going to be a data loss. A big IF, I know but with the frequency of non-transparent errors shown in Google’s experience there’s going to be a lot of rebuilds going on, and with 10 drive RAID groups the chances of having 2 occur during the same 24 hour period in the same raid group is definitely higher than 0.
You gotta read the fine print in the Google FAST’16 paper. The Table 2 is no help to me. It shows fraction of drives with different types of errors and fractions of drive days with different types of errors but if I add up the numbers in the table, they don’t get me to the rate they quote in section 3.1 Non-transparent errors paragraph. But if I use the rate they quote in the paragraph it scares the #(&)* out of me.
Ray
Ray,
Sorry so slow in responding, work gets in the way sometimes, doesn’t it?
I think Google must have published that paper just in time. I searched and no mention of RAISE or equivalent. http://www.kingston.com/us/ssd/raise
“This technology provides the protection and reliability of RAID 5 (Redundant Array Of Independent Disks) on a single SSD drive without twice the write overhead of parity and an Uncorrectable Bit Error Rate (UBER) of nearly one quadrillion times less than a standard SSD Flash Storage Processor without R.A.I.S.E. ™ or 1 bit error for every 100 octillion bits (10^-29) or ~111022302462515.66 Petabytes of data processed.”
But it makes sense that Google doesn’t mention it, from the timelines (6 years of SSD drill down), RAISE came along later. Notice EMC in their all flash VMAX, supports two raid levels, 7+1 or 14+2. Do you suppose the 6 9’s only applies to 14+2 configs? Or there is a risk of blowing out a RAID5? I don’t think so. Why not the urgency to abandon RAID5 *at the very least?*
Back to Pure and XtremIO. They use custom RAID levels, there is a paper that describes EMC’s It is mind numbing: http://bit.ly/1QNb5gB. Pure and XtremIO can use crappy SSD because they can tolerate multiple failures. EMC AFA VMAX, not so much – but they don’t need to. Those EFDs are far more reliable than spinning or older SSDs that are behind the tech curve. Consumer SSD = cheap, this is a good thing for margins.
Here’s where I think the triple parity RAID thing is needed (certainly not in a VMAX AFA or equivalent Enterprise AFA):
– Those that are stuck with a legacy RAID infra and want to use cheap SSD for their formerly 30+2 RAID array
– Those that are looking at 20, 30 TB SATA drive futures (new RAID tech or otherwise) and want to have the luxury of lazy rebuilds on first fail
– HPC with many thousands of disks triple parity is a must (GPFS, ZFS and others)
Rob,
Thanks again for your insightful commentary. I think we can agree here. The need for triple parity RAID, seems to be driven primarily because larger SSDs are now coming on the market. I wasn’t aware of RAISE but it looks very similar to IBM FlashSystems vRAID (see: http://www-03.ibm.com/systems/storage/flash/v9000/). I still think RAID5 and RAID6 or DP will continue to have a place. But as SSDs get much larger, rebuild times get longer, the risk starts to go up.
HPC is another world altogether.
RT @RayLucchesi: [Blog post] Has triple parity Raid time come? https://t.co/jGn9YXaC4T #SFD10 @NimbleStorage
Betteridge’s Law (the answer to any headline phrased as a question is no) applies here. Double failures due to long rebuild times are a real issue. I have seen data lost for that reason. However, when you’re talking about that level of redundancy “parity” is a misnomer. It’s no longer double, triple, or any level of parity. It’s erasure coding, which is a huge family of algorithms. By using the general term we not only improve accuracy for its own sake but we also retain the flexibility to change algorithms without having to change documentation or other communication about what it is we’re doing. “Triple parity” is just a marketing term, and a particularly unhelpful one at that.
Jeff and Darcy, thanks for your comment. I know this is not parity in the pure sense but it is a trusty simplification of the idea. The title was bad enough as it was, to say Has triple level erasure coding RAID time come? Would have been even worse. Yes, I agree it’s erasure coding and yes, in general I agree that being technically accurate is well worth it. Sorry about that.
Ray
RT @RayLucchesi: [Blog post] Has triple parity Raid time come? https://t.co/jGn9YXaC4T #SFD10 @NimbleStorage
Back at SFD10 a couple of weeks back now when visiting with Nimble Storage they mentioned that .. https://t.co/E50SjLOKzR #Tech
Link: Has triple parity Raid time come? https://t.co/kQvXnTjpv7 @nimblestorage, @RayLucchesi, #SFD10
RT @GestaltIT: Link: Has triple parity Raid time come? https://t.co/kQvXnTjpv7 @nimblestorage, @RayLucchesi, #SFD10
RT @GestaltIT: Link: Has triple parity Raid time come? https://t.co/kQvXnTjpv7 @nimblestorage, @RayLucchesi, #SFD10