I have been thinking about writing a post on “Is Flash Dead?” for a while now. Well at least since talking with IBM research a couple of weeks ago on their new memory technologies that they have been working on.
As we have discussed before, NAND flash memory has some serious limitations as it’s shrunk below 11nm or so. For instance, write endurance plummets, memory retention times are reduced and cell-to-cell interactions increase significantly.
These issues are not that much of a problem with today’s flash at 20nm or so. But to continue to follow Moore’s law and drop the price of NAND flash on a $/Gb basis, it will need to shrink below 16nm. At that point or soon thereafter, current NAND flash technology will no longer be viable.
Other non-NAND based non-volatile memories
That’s why IBM and others are working on different types of non-volatile storage such as PCM (phase change memory), MRAM (magnetic RAM) , FeRAM (Ferroelectric RAM) and others. All these have the potential to improve general reliability characteristics beyond where NAND Flash is today and where it will be tomorrow as chip geometries shrink even more.
IBM seems to be betting on MRAM or racetrack memory technology because it has near DRAM performance, extremely low power and can store far more data in the same amount of space. It sort of reminds me of delay line memory where bits were stored on a wire line and read out as they passed across a read/write circuit. Only in the case of racetrack memory, the delay line is etched in a silicon circuit indentation with the read/write head implemented at the bottom of the cleft.
Graphene as the solution
Then along comes Graphene based Flash Memory. Graphene can apparently be used as a substitute for the storage layer in a flash memory cell. According to the report, the graphene stores data using less power and with better stability over time. Both crucial problems with NAND flash memory as it’s shrunk below today’s geometries. The research is being done at UCLA and is supported by Samsung, a significant manufacturer of NAND flash memory today.
Current demonstration chips are much larger than would be useful. However, given graphene’s material characteristics, the researchers believe there should be no problem scaling it down below where NAND Flash would start exhibiting problems. The next iteration of research will be to see if their scaling assumptions can hold when device geometry is shrunk.
The other problem is getting graphene, a new material, into current chip production. Current materials used in chip manufacturing lines are very tightly controlled and building hybrid graphene devices to the same level of manufacturing tolerances and control will take some effort.
So don’t look for Graphene Flash Memory to show up anytime soon. But given that 16nm chip geometries are only a couple of years out and 11nm, a couple of years beyond that, it wouldn’t surprise me to see Graphene based Flash Memory introduced in about 4 years or so. Then again, I am no materials expert, so don’t hold me to this timeline.
The problem with SSDs is that they typically all fail at some level of data writes, called the write endurance specification.
As such, if you purchase multiple drives from the same vendor and put them in a RAID group, sometimes this can cause multiple failures because of this.
Say the SSD write endurance is 250TBs (you can only write 250TB to the SSD before write failure), and you populate a RAID 5 group with them in a 3 -data drives + 1-parity drive configuration. As, it’s RAID 5, parity rotates around to each of the drives sort of equalizing the parity write activity.
Now every write to the RAID group is actually two writes, one for data and one for parity. Thus, the 250TB of write endurance per SSD, which should result in 1000TB write endurance for the RAID group is reduced to something more like 125TB*4 or 500TB. Specifically,
Each write to a RAID 5 data drive is replicated to the RAID 5 parity drive,
As each parity write is written to a different drive, the parity drive of the moment can contain at most 125TB of data writes and 125TB of parity writes before it uses up it’s write endurance spec.
So for the 4 drive raid group we can write 500TB of data, evenly spread across the group can no longer be written.
As for RAID 6, it almost looks the same except that you lose more SSD life, as you write parity twice. E.g. for a 6 data drive + 2 parity drive RAID 6 group, with similar write endurance, you should get 83.3TB of data writes and 166.7TB of parity writes per drive. Which for an 8 drive parity group is 666.4TB of data writes before RAID group write endurance lifetime is used up.
For RAID 1 with 2 SSDs in the group, as each drive mirrors writes to the other drive, you can only get 125TB per drive or 250TB total data writes per RAID group.
But the “real” problem is much worse
If I am writing the last TB to my RAID group and if I have managed to spread the data writes evenly across the RAID group, one drive will go out right away. Most likely the current parity drive will throw a write error. BUT the real problem occurs during the rebuild.
With a 256GB SSD in the RAID 5 group, with 100MB/s read rate, reading the 3 drives in parallel to rebuild the fourth will take ~43 minutes. However that means all the good SSDs are idle except for rebuild IO. Most systems limit drive rebuild IO to no more than 1/2 to 1/4 of the drive activity (possibly much less) in the RAID group. As such, a more realistic rebuild time can be anywhere from 86 to 169 minutes or more.
Now because rebuild time takes a long time, data must continue to be written to the RAID group. But as we are aware, most of the remaining drives in the RAID group are likely to be at the end of their write endurance already.
Thus, it’s quite possible that another SSD in the RAID group will fail while the first drive is rebuilt.
Resulting in a catastrophic data loss (2 bad drives in a RAID 5, 3 drives in a RAID 6 group).
RAID 1 groups with SSDs are probably even more prone to this issue. When the first drive fail, the second should follow closely behind.
Yes, but is this probable
First we are talking TBs of data here. The likelihood that a RAID groups worth of drives would all have the same amount of data written to them even within a matter of hours of rebuild time is somewhat unlikely. That being said, the lower the write endurance of the drives, the more equal the SSD write endurance at the creation of the RAID group, and the longer it takes to rebuild failing SSDs, the higher the probability of this type of castastrophic failure.
In any case, the problem is highly likely to occur with RAID 1 groups using similar SSDs as the drives are always written in pairs.
But for RAID 5 or 6, it all depends on how well data striping across the RAID group equalizes data written to the drives.
For hard disks this was a good thing and customers or storage systems all tried to equalize IO activity across drives in a RAID group. So with good (manual or automated) data striping the problem is more likely.
Automated storage tiering using SSDs is not as easy to fathom with respect to write endurance catastrophes. Here a storage system automatically moves the hottest data (highest IO activity) to SSDs and the coldest data down to hard disks. In this fashion, they eliminate any manual tuning activity but they also attempt to minimize any skew to the workload across the SSDs. Thus, automated storage tiering, if it works well, should tend to spread the IO workload across all the SSDs in the highest tier, resulting in similar multi-SSD drive failures.
However, with some vendor’s automated storage tiering, the data is actually copied and not moved (that is the data resides both on disk and SSD). In this scenario losing an SSD RAID group or two might severely constrain performance, but does not result in data loss. It’s hard to tell which vendors do which but customers can should be able to find out.
So what’s an SSD user to do
Using RAID 4 for SSDs seems to make sense. The reason we went to RAID 5 and 6 was to avoid hot (parity write) drive(s) but with SSD speeds, having a hot parity drive or two is probably not a problem. (Some debate on this, we may lose some SSD performance by doing this…). Of course the RAID 4 parity drive will die very soon, but paradoxically having a skewed workload within the RAID group will increase SSD data availability.
Mixing SSDs age within RAID groups as much as possible. That way a single data load level will not impact multiple drives.
Turning off LUN data striping within a SSD RAID group so data IO can be more skewed.
Monitoring write endurance levels for your SSDs, so you can proactively replace them long before they will fail
Keeping good backups and/or replicas of SSD data.
I learned the other day that most enterprise SSDs provide some sort of write endurance meter that can be seen at least at the drive level. I would suggest that all storage vendors make this sort of information widely available in their management interfaces. Sophisticated vendors could use such information to analyze the SSDs being used for a RAID group and suggest which SSDs to use to maximize data availability.
But in any event, for now at least, I would avoid RAID 1 using SSDs.
I talked last week with some folks from Nimbus Data who were discussing their new storage subsystem. Apparently it uses eMLC (enterprise Multi-Level Cell) NAND SSDs for its storage and has no SLC (Single Level Cell) NAND at all.
Nimbus believes with eMLC they can keep the price/GB down and still supply the reliability required for data center storage applications. I had never heard of eMLC before but later that week I was scheduled to meet with Texas Memory Systems and Micron Technologies that helped get me up to speed on this new technology.
eMLC and its cousin, eSLC are high durability NAND parts which supply more erase/program cycles than generally available from MLC and SLC respectively. If today’s NAND technology can supply 10K erase/program cycles for MLC and similarly, 100K erase/program cycles for SLC then, eMLC can supply 30K. Never heard a quote for eSLC but 300K erase/program cycles before failure might be a good working assumption.
The problem is that NAND wears out, and can only sustain so many erase/program cycles before it fails. By having more durable parts, one can either take the same technology parts (from MLC to eMLC) to use them longer or move to cheaper parts (from SLC to eMLC) to use them in new applications.
This is what Nimbus Data has done with eMLC. Most data center class SSD or cache NAND storage these days are based on SLC. But SLC, with only on bit per cell, is very expensive storage. MLC has two (or three) bits per cell and can easily halve the cost of SLC NAND storage.
Moreover, the consumer market which currently drives NAND manufacturing depends on MLC technology for cameras, video recorders, USB sticks, etc. As such, MLC volumes are significantly higher than SLC and hence, the cost of manufacturing MLC parts is considerably cheaper.
But the historic problem with MLC NAND is the reduction in durability. eMLC addresses that problem by lengthening the page programming (tProg) cycle which creates a better, more lasting data write, but slows write performance.
The fact that NAND technology already has ~5X faster random write performance than rotating media (hard disk drives) makes this slightly slower write rate less of an issue. If eMLC took this to only ~2.5X disk writes it still would be significantly faster. Also, there are a number of architectural techniques that can be used to speed up drive write speeds easily incorporated into any eMLC SSD.
How long will SLC be around?
The industry view is that SLC will go away eventually and be replaced with some form of MLC technology because the consumer market uses MLC and drives NAND manufacturing. The volumes for SLC technology will just be too low to entice manufacturers to support it, driving the price up and volumes even lower – creating a vicious cycle which kills off SLC technology. Not sure how much I believe this, but that’s conventional wisdom.
The problem with this prognosis is that by all accounts the next generation MLC will be even less durable than today’s generation (not sure I understand why but as feature geometry shrinks, they don’t hold charge as well). So if today’s generation (25nm) MLC supports 10K erase/program cycles, most assume the next generation (~18nm) will only support 3K erase/program cycles. If eMLC then can still support 30K or even 10K erase/program cycles that will be a significant differentiator.
Technology marches on. Something will replace hard disk drives over the next quarter century or so and that something is bound to be based on transistorized logic of some kind, not the magnetized media used in disks today. Given todays technology trends, it’s unlikely that this will continue to be NAND but something else will most certainly crop up – stay tuned.
Intel-Micron Flash Technologies just issued another increase in NAND density. This one’s manages to put 8GB on a single chip with MLC(2) technology in a 167mm square package or roughly a half inch per side.
You may recall that Intel-Micron Flash Technologies (IMFT) is a joint venture between Intel and Micron to develop NAND technology chips. IMFT chips can be used by any vendor and typically show up in Intel SSDs as well as other vendor systems. MLC technology is more suitable for use in consumer applications but at these densities it’s starting to make sense for use by data centers as well. We have written before about MLC NAND used in the enterprise disk by STEC and Toshiba’s MLC SSDs. But in essence MLC NAND reliability and endurability will ultimately determine its place in the enterprise.
But at these densities, you can just throw more capacity at the problem to mask MLC endurance concerns. For example, with this latest chip, one could conceivably have a single layer 2.5″ configuration with almost 200GBs of MLC NAND. If you wanted to configure this as 128GB SSD you could use the additional 72GB of NAND for failing pages. Doing this could conceivably add more than 50% to the life of an SSD.
SLC still has better (~10X) endurance but being able to ship 2X the capacity in the same footprint can help. Of course, MLC and SLC NAND can be combined in a hybrid device to give some approximation of SLC reliability at MLC costs.
IMFT made no mention of SLC NAND chips at the 25nm technology node but presumably this will be forthcoming shortly. As such, if we assume the technology can support a 4GB SLC NAND in a 167mm**2 chip it should be of significant interest to most enterprise SSD vendors.
A couple of things missing from yesterday’s IMFT press release, namely
read/write performance specifications for the NAND chip
write endurance specifications for the NAND chip
SSD performance is normally a function of all the technology that surrounds the NAND chip but it all starts with the chip. Also, MLC used to be capable of 10,000 write/erase cycles and SLC was capable of 100,000 w/e cycles but most recent technology from Toshiba (presumably 34nm technology) shows a MLC NAND write/erase endurance of only 1400 cycles. Which seems to imply that as the NAND technology increases density write endurance rates degrade. How much is subject to much debate and with the lack of any standardized w/e endurance specifications and reporting, it’s hard to see how bad it gets.
The bottom line, capacity is great but we need to know w/e endurance to really see where this new technology fits. Ultimately, if endurance degrades significantly such NAND technology will only be suitable for consumer products. Of course at ~10X (just guessing) the size of the enterprise market maybe that’s ok.
Today Toshiba announced a new series of SSD drives based on their 32NM MLC NAND technology. The new technology is interesting but what caught my eye was another part of their website, i.e., their SSD FAQs. We have talked about MLC NAND technology before and have discussed its inherent reliability limitations, but this is the first time I have seen some company discuss their reliability estimates so publicly. This was documented more in an IDC white paperon their site but the summary on the FAQ web page speaks to most of it.
Toshiba’s answer to the MLC write endurance question all revolves around how much data a laptop user writes per day which their study makes clear . Essentially, Toshiba assumes MLC NAND write endurance is 1,400 write/erase cycles and for their 64GB drive a user would have to write, on average, 22GB/day for 5 years before they would exceed the manufacturers warranty based on write endurance cycles alone.
5 years is ~1825 days
22GB/day over 5 years would be over 40,000GB of data written
If we divide this by the 1400 MLC W/E cycle limits given above, that gives us something like 28.7 NAND pages could fail and yet still support write reliability.
Not sure what Toshiba’s MLC SSD supports for page size but it’s not unusual for SSDs to ship an additional 20% of capacity to over provision for write endurance and ECC. Given that 20% of 64GB is ~12.8GB, and it has to at least sustain ~28.7 NAND page failures, this puts Toshiba’s MLC NAND page at something like 512MB or ~4Gb which makes sense.
The not so surprising thing about this analysis is that as drive capacity goes up, write endurance concerns diminish because the amount of data that needs to be written daily goes up linearly with the capacity of the SSD. Toshiba’s latest drive announcements offer 64/128/256GB MLC SSDs for the mobile market.
Toshiba studies mobile users write activity
To come at their SSD reliability estimate from another direction, Toshiba’s laptop usage modeling study of over 237 mobile users showed the “typical” laptop user wrote an average of 2.4GB/day (with auto-save&hibernate on) and a “heavy” labtop user wrote 9.2GB/day under similar specifications. Now averages are well and good but to really put this into perspective one needs to know the workload variability. Nonetheless, their published results do put a rational upper bound on how much data typical laptop users write during a year that can then be used to compute (MLC) SSD drive reliability.
I must applaud Toshiba for publishing some of their mobile user study information to help us all better understand SSD reliability for this environment. It would have been better to see the complete study including all the statistics, when it was done, how users were selected, and it would have been really nice to see this study done by a standard’s body (say SNIA) rather than a manufacturer, but these are all personal nits.
Now, I can’t wait to see a study on write activity for the “heavy” enterprise data center environment, …
I haven’t seen much of a specification on STEC’s new enterprise MLC SSD but it should be interesting. So far everything I have seen seems to indicate that it’s a pure MLC drive with no SLC NAND. This is difficult for me to believe but could easily be cleared up by STEC or their specifications. Most likely it’s a hybrid SLC-MLC drive similar, at least from the NAND technology perspective, to FusionIO’s SSD drive.
MLC write endurance issue
My difficulty with a pure MLC enterprise drive is the write endurance factor. MLC NAND can only endure around 10,000 erase/program passes before it starts losing data. With a hybrid SLC-MLC design one could have the heavy write data go to SLC NAND which has a 100,000 erase/program pass lifecycle and have the less heavy write data go to MLC. Sort of like a storage subsystem “fast write” which writes to cache first and then destages to disk but in this case the destage may never happen if the data is written often enough.
The only flaw in this argument is that as the SSD drives get bigger (STEC’s drive is available supporting up to 800GB) this becomes less of an issue. Because with more raw storage the fact that a small portion of the data is very actively written gets swamped by the fact that there is plenty of storage to hold this data. As such, when one NAND cell gets close to its lifetime another, younger cell can be used. This process is called wear leveling. STEC’s current SLC Zeus drive already has sophisticated wear leveling to deal with this sort of problem with SLC SSDs and doing this for MLCs just means having larger tables to work with.
I guess at some point, with multi-TB per drives, the fact that MLC cannot sustain more than 10,000 erase/write passes becomes moot. Because there just isn’t that much actively written data out there in an enterprise shop. When you amortize the portion of highly written data as a percentage of a drive, the more drive capacity, the smaller the active data percentages become. As such, as SSD drive capacities gets larger this becomes less of an issue. I figure with 800GB drives, active data proportion might still be high enough to cause a problem but it might not be an issue at all.
Of course with MLC it’s also cheaper to over provision NAND storage to also help with write endurance. For an 800GB MLC SSD, you could easily add another 160GB (20% over provisioning) fairly cheaply. As such, over provisioning will also allow you to sustain an overall drive write endurance that is much higher than the individual NAND write endurance.
Another solution to the write endurance problem is to increase the power of ECC to handle write failures. This would probably take some additional engineering and may or may not be in the latest STEC MLC drive but it would make sense.
The other issue about MLC NAND is that it has slower read and erase/program cycle times. Now these are still order’s of magnitude faster than standard disk but slower than SLC NAND. For enterprise applications SLC SSDs are blistering fast and are often performance limited by the subsystem they are attached to. So, the fact that MLC SSDs are somewhat slower than SLC SSDs may not even be percieved by enterprise shops.
MLC performance is slower because it takes longer to read a cell with multiple bits in it than it takes with just one. MLC, in one technology I am aware of, encodes 2-bits in the voltage that is programmed in or read out from a cell, e.g., VoltageA = “00”, VoltageB=”01″, VoltageC=”10″, and VoltageD=”11″. This gets more complex with 3 or more bits per cell but the logic holds. With multiple voltages, determining which voltage level is present is more complex for MLC and hence, takes longer to perform.
In the end I would expect STEC’s latest drive to be some sort of SLC-MLC hybrid but I could be wrong. It’s certainly possible that STEC have gone with just an MLC drive and beefed up the capacity, over provisioning, ECC, and wear leveling algorithms to handle its lack of write endurance
MLC takes over the world
But the major issue with using MLC in SSDs is that MLC technology is driving the NAND market. All those items in the photo above are most probably using MLC NAND, if not today then certainly tomorrow. As such, the consumer market will be driving MLC NAND manufacturing volumes way above anything the SLC market requires. Such volumes will ultimately make it unaffordable to manufacture/use any other type of NAND, namely SLC in most applications, including SSDs.
So sooner or later all SSDs will be using only MLC NAND technology. I guess the sooner we all learn to live with that the better for all of us.