Flash field experience at Google
In a FAST’16 article I recently read (Flash reliability in production: the expected and unexpected, see p. 67), researchers at Google reported on field experience with flash drives in their data centers, totaling many millions of drive days covering MLC, eMLC and SLC drives with a minimum of 4 years of production use (3 years for eMLC). In some cases, they had 2 generations of the same drive in their field population. SSD reliability in the field is not what I would have expected and was a surprise to Google as well.
The SSDs seem to be used in a number of different application areas but mainly as SSDs with a custom designed PCIe interface (FusionIO drives maybe?). Aside from the technology changes, there were some lithographic changes as well from 50 to 34nm for SLC and 50 to 43nm for MLC drives and from 32 to 25nm for eMLC NAND technology.
Flash field results – good news
They published P/E cycle statistics using SLC drives which have a P/E limit of 100K P/E cycles, eMLC drives with 10K P/E cycles and MLC drives with 3K P/E cycles. Over their minimum of 4 year field life (eMLC 3 year life), the average P/E cycles consumed for their drives ranged from 185 to 860 P/E cycles for SLC drives, 544 to 949 P/E cycles for MLC drives, 377 to 607 P/E cycles for eMLC drives, While the 949 number is close to 1/3rd of the P/E limit for MLC drives, it’s an average across their entire MLC field population and it’s for a minimum of 4 years of use. So as far as Google’s results are concerned, P/E cycle exhaustion or NAND endurance are not a real concern.
Even though their P/E cycle averages were nowhere close to exceeding P/E cycle limits they did track RBER (read bit error rates) for each drive class across the range of P/E cycles they had in their base. The data showed that even after the P/E limits for MLC drives the RBER did not increase significantly and only showed a gradual increase.
I speak for most of my analyst brethren and our belief has mostly been that the problem with SSD reliability was NAND endurance. In fact, a while back, I wrote a post on why using a new batch of SSDs in a RAID group might not be a good idea (see The antifragility of disk RAID groups, the fragility of SSDs, and what to do about it [love those titles]). The idea was that a fresh group of SSDs had a high likelihood of failing at the same time when they all exceeded their P/E cycles. Google’s data shows that this is NOT the case – wrong again Ray.
There was a lot of discussion about RBER in the paper. But in the end, RBERs were mostly recovered by ECC or occasionally by read retry, so were considered transient errors and as such, weren’t as interesting.
Flash field results – bad news
Google did find one disturbing issue in their flash field results, the number of (mostly read) uncorrectable errors (UE) coming off their SSDs was higher than anticipated. That is, in their data, 20 to 63% of all SSD drives experienced some uncorrectable read errors during their life in the field, accounting for 2-6 out of every 1000 drive days were impacted by a UE. According to their data, uncorrectable read errors accounted for ~2 orders of magnitude (100X) more non-transparent (unrecoverable) IO errors than any other fault. For example, write uncorrectable errors occurred in about 1.5-2.5% of drives or about 1-4 out of every 10,000 drive days.
UE’s are an actual data loss at the drive level. Most storage subsystems protect against UE data loss by using replication, RAID or erasure code mechanisms. It’s important to note that disk drives also encounter UEs. But from Google’s field experience SSDs have a much higher UE rate than disk drives.
According to Google’s data, SSD UEs are somewhat correlated with P/E cycles, that is the more P/E cycles a drive has, the higher the probability of UEs it encounters, But that’s about the only correlation they could find. They also found that if a drive had a UE one month, there was a 30% chance it would see another one, the next month. They couldn’t find any correlation between UE and RBER, write error rate, erase error rate or just about anything else they looked at.
Another thing they noticed was that the number of bad blocks in the SSD field use seemed to encounter a step function. That is if an SSD exceeded 2-4 bad blocks in field use, the drive would rapidly encounter 200-400 bad blocks shortly thereafter. This could be due to a NAND chip failure but that seemed more supposition than fact to me.
Overall, Google found ~6-9% of their SSDs required a repair action over 4 years (drive being swapped out or replaced). They expected SLC and eMLC drives to require fewer repairs but their data showed that NAND technology didn’t really matter with respect to drive repairs. Similarly, disk drives are also replaced and in Google’s field experience with disk drives, they are seeing a 2-9% annual replacement rate, which is much worse than SSDs.
Other flash field experience
A major vendor mentioned in passing that their field experience showed SSD P/E cycle consumption was less than 2% on average across their entire base and that their SSD replacement rate was 1/3rd of what their disk replacement rate was. Both of these results agree with Google’s data.
There was no mention of UE rate differences between SSDs and disk drives but as most of these would have been dealt with via RAID protection, it wouldn’t have led to data loss or drive repair unless it was extreme.
What this means for data center today
One suggestion is that you should only use SSDs in RAID groups. It doesn’t necessarily need to be dual parity/RAID 6 (as rebuild times would be pretty quick for an SSD RAID group). Replication or mirroring would also be a viable alternative to deal with the propensity of UEs.
As for PCs and Laptops with SSDs, make sure you have a viable backup policy because there is a high likelihood is 20-63% will encounter an unrecoverable read error over 4 years of use. Some anecdotal evidence: I have used disk-only Macs for over 20 years now with none ever having a disk failure. I used SSD based laptops for 4 years and recently purchased a Fusion Drive (SSD-Disk hybrid) desktop. Within 3 months of Fusion drive use, I had an unrecoverable error that wiped out my Fusion storage and had to completely restore it from backup.
Photo credit(s): All tables and graphs from the original paper in FAST’16 proceedings