The end of NAND is near, maybe…

In honor of today’s Flash Summit conference, I give my semi-annual amateur view of competing NAND technologies.

I was talking with a major storage vendor today and they said they were sampling sub-20nm NAND chips with P/E cycles of 300 with a data retention period under a week at room temperatures. With those specifications these chips almost can’t get out of the factory with any life left in them.

On the other hand the only sub-20nm (19nm) NAND information I could find online were inside the new Toshiba THNSNF SSDs with toggle MLC NAND that guaranteed data retention of 3 months at 40°C.   I could not find any published P/E cycle specifications for the NAND in their drive but presumably this is at most equivalent to their prior generation 24 nm NAND or at worse somewhere below that generations P/E cycles. (Of course, I couldn’t find P/E cycle specifications for that drive either but similar technology in other drives seems to offer native 3000 P/E cycles.)

Intel-Micron, SanDisk and others have all recently announced 20nm MLC NAND chips with a P/E cycles around 3K to 5K.

Nevertheless, as NAND chips go beyond their rated P/E cycle quantities, NAND bit errors increase. With a more powerful ECC algorithm in SSDs and NAND controllers, one can still correct the data coming off the NAND chips.  However at some point beyond 24 bit ECC this probably becomes unsustainable. (See interesting post by NexGen on ECC capabilities as NAND die size shrinks).

Not sure how to bridge the gap between 3-5K P/E cycles and the 300 P/E cycles being seen by storage vendors above but this may be a function of prototype vs. production technology and possibly it had other characteristics they were interested in.

But given the declining endurance of NAND below 20nm, some industry players are investigating other solid state storage technologies to replace NAND, e.g.,  MRAM, FeRAM, PCM and ReRAM all of which are current contenders, at least from a research perspective.

MRAM is currently available in small capacities from Everspin and elsewhere but hasn’t really come up with similar densities on the order of today’s NAND technologies.

ReRAM is starting to emerge in low power applications as a substitute for SRAM/DRAM, but it’s still early yet.

I haven’t heard much about FeRAM other than last year researchers at Purdue having invented a new non-destructive read FeRAM they call FeTRAM.   Standard FeRAMs are already in commercial use, albeit in limited applications from Ramtron and others but density is still a hurdle and write performance is a problem.

Recently the PCM approach has heated up as PCM technology is now commercially available being released by Micro.  Yes the technology has a long way to go to catch up with NAND densities (available at 45nm technology) but it’s yet another start down a technology pathway to build volume and research ways to reduce cost, increase density and generally improve the technology.  In the mean time I hear it’s an order of magnitude faster than NAND.

Racetrack memory, a form of MRAM using wires to store multiple bits, isn’t standing still either.  Last December, IBM announced they have demonstrated  Racetrack memory chips in their labs.  With this milestone IBM has shown how a complete Racetrack memory chip could be fabricated on a CMOS technology lines.

However, in the same press release from IBM on recent research results, they announced a new technique to construct CMOS compatible graphene devices on a chip.  As we have previously reported, another approach to replacing standard NAND technology  uses graphene transistors to replace the storage layer of NAND flash.  Graphene NAND holds the promise of increasing density with much better endurance, retention and reliability than today’s NAND.

So as of today, NAND is still the king of solid state storage technologies but there are a number of princelings and other emerging pretenders, all vying for its throne of tomorrow.

Comments?

Image: 20 nanometer NAND Flash chip by IntelFreePress

Potential data loss using SSD RAID groups

Ultrastar SSD400 4 (c) 2011 Hitachi Global Storage Technologies (from their website)
Ultrastar SSD400 4 (c) 2011 Hitachi Global Storage Technologies (from their website)

The problem with SSDs is that they typically all fail at some level of data writes, called the write endurance specification.

As such, if you purchase multiple drives from the same vendor and put them in a RAID group, sometimes this can cause multiple failures because of this.

Say the SSD write endurance is 250TBs (you can only write 250TB to the SSD before write failure), and you populate a RAID 5 group with them in a 3 -data drives + 1-parity drive configuration.  As, it’s RAID 5, parity rotates around to each of the drives sort of equalizing the parity write activity.

Now every write to the RAID group is actually two writes, one for data and one for parity.  Thus, the 250TB of write endurance per SSD, which should result in 1000TB write endurance for the RAID group is reduced to something more like 125TB*4 or 500TB.  Specifically,

  • Each write to a RAID 5 data drive is replicated to the RAID 5 parity drive,
  • As each parity write is written to a different drive, the parity drive of the moment can contain at most 125TB of data writes and 125TB of parity writes before it uses up it’s write endurance spec.
  • So for the 4 drive raid group we can write 500TB of data, evenly spread across the group can no longer be written.

As for RAID 6, it almost looks the same except that you lose more SSD life, as you write parity twice.  E.g. for a 6 data drive + 2 parity drive RAID 6 group, with similar write endurance, you should get 83.3TB of data writes and 166.7TB of parity writes per drive. Which for an 8 drive parity group is 666.4TB of data writes before RAID group write endurance lifetime is used up.

For RAID 1 with 2 SSDs in the group, as each drive mirrors writes to the other drive, you can only get 125TB per drive or 250TB total data writes per RAID group.

But the “real” problem is much worse

If I am writing the last TB to my RAID group and if I have managed to spread the data writes evenly across the RAID group, one drive will go out right away.  Most likely the current parity drive will throw a write error. BUT the real problem occurs during the rebuild.

  • With a 256GB SSD in the RAID 5 group, with 100MB/s read rate, reading the 3 drives in parallel to rebuild the fourth will take ~43 minutes.  However that means all the good SSDs are idle except for rebuild IO.  Most systems limit drive rebuild IO to no more than 1/2 to 1/4 of the drive activity (possibly much less) in the RAID group.  As such, a more realistic rebuild time can be anywhere from 86 to 169 minutes or more.
  • Now because rebuild time takes a long time, data must continue to be written to the RAID group.   But as we are aware, most of the remaining drives in the RAID group are likely to be at the end of their write endurance already.
  • Thus, it’s quite possible that another SSD in the RAID group will fail while the first drive is rebuilt.

Resulting in a catastrophic data loss (2 bad drives in a RAID 5, 3 drives in a RAID 6 group).

RAID 1 groups with SSDs are probably even more prone to this issue. When the first drive fail, the second should follow closely behind.

Yes, but is this probable

First we are talking TBs of data here. The likelihood that a RAID groups worth of drives would all have the same amount of data written to them even within a matter of hours of rebuild time is somewhat unlikely. That being said, the lower the write endurance of the drives, the more equal the SSD write endurance at the creation of the RAID group, and the longer it takes to rebuild failing SSDs, the higher the probability of this type of castastrophic failure.

In any case, the problem is highly likely to occur with RAID 1 groups using similar SSDs as the drives are always written in pairs.

But for RAID 5 or 6, it all depends on how well data striping across the RAID group equalizes data written to the drives.

For hard disks this was a good thing and customers or storage systems all tried to equalize IO activity across drives in a RAID group. So with good (manual or automated) data striping the problem is more likely.

Automated storage tiering using SSDs is not as easy to fathom with respect to write endurance catastrophes.  Here a storage system automatically moves the hottest data (highest IO activity) to SSDs and the coldest data down to hard disks.  In this fashion, they eliminate any manual tuning activity but they also attempt to minimize any skew to the workload across the SSDs. Thus, automated storage tiering, if it works well, should tend to spread the IO workload across all the SSDs in the highest tier, resulting in similar multi-SSD drive failures.

However, with some vendor’s automated storage tiering, the data is actually copied and not moved (that is the data resides both on disk and  SSD).  In this scenario losing an SSD RAID group or two might severely constrain performance, but does not result in data loss.   It’s hard to tell which vendors do which but customers can should be able to find out.

So what’s an SSD user to do

  • Using RAID 4 for SSDs seems to make sense.  The reason we went to RAID 5 and 6 was to avoid hot (parity write) drive(s) but with SSD speeds, having a hot parity drive or two is probably not a problem. (Some debate on this, we may lose some SSD performance by doing this…). Of course the RAID 4 parity drive will die very soon, but paradoxically having a skewed workload within the RAID group will increase SSD data availability.
  • Mixing SSDs age within RAID groups as much as possible. That way a single data load level will not impact multiple drives.
  • Turning off LUN data striping within a SSD RAID group so data IO can be more skewed.
  • Monitoring write endurance levels for your SSDs, so you can proactively replace them long before they will fail
  • Keeping good backups and/or replicas of SSD data.

I learned the other day that most enterprise SSDs provide some sort of write endurance meter that can be seen at least at the drive level.  I would suggest that all storage vendors make this sort of information widely available in their management interfaces.  Sophisticated vendors could use such information to analyze the SSDs being used for a RAID group and suggest which SSDs to use to maximize data availability.

But in any event, for now at least, I would avoid RAID 1 using SSDs.

Comments?