Intel’s 320 SSD “8MB problem”

Intel SSD 320_001 by TAKA@P.P.R.S (cc) (from Flickr)
Intel SSD 320_001 by TAKA@P.P.R.S (cc) (from Flickr)

Read a recent news item on Intel being able to replicate their 320 SSD 8MB problem that some customers have been complaining about.

Apparently the problem occurs when power is suddenly removed from the device.  The end result is that the SSD’s capacity is restricted to 8MB from 40GB or more.

I have seen these sorts of problems before.  It probably has something to do with table updating activity associated with SSD wear leveling.

Wear leveling

NAND wear leveling looks very similar to virtual memory addressing and maps storage block addresses to physical NAND pages. Essentially something similar to a dynamic memory page table is maintained that shows where the current block is located in the physical NAND space, if present.  Typically, there are multiple tables involved, one for spare pages, another for mapping current block addresses to NAND page location and offset, one for used pages, etc.  All these tables have to be in some sort of non-volatile storage so they persist after power is removed.

Updating such tables and maintaining their integrity is a difficult endeovor.  More than likely some sort of table update is not occurring in an ACID fashion.

Intel’s fix

Intel has replicated the problem and promises a firmware fix. In my experience this is entirely possible.  Most probably customer data has not been lost (although this is not a certainty), it’s just not accessible at the moment. And Intel has reminded everyone that as with any storage device everyone should be taking periodic backups to other devices, SSDs are no exception.

I am certain that Intel and others are enhancing their verification and validation (V&V) activities to better probe and test the logic behind wear leveling fault tolerance, at least with respect to power loss. Of course, redesigning the table update algorithm to be more robust, reliable, and fault tolerant is a long range solution to these sorts of problems but may take longer than a just a bug fix.

The curse of complexity

But all this begs a critical question, as one puts more and more complexity outboard into the drive are we inducing more risk?

It’s a perennial problem in the IT industry. Software bugs are highly correlated to complexity and thereby, are ubiquitous, difficult to eliminate entirely, and often escape any and all efforts to eradicate them before customer shipments.  However, we can all get better at reducing bugs, i.e., we can make them less frequent, less impactful, and less visible.

What about disks?

All that being said, rotating media is not immune to the complexity problem. Disk drives have different sorts of complexity, e.g., here block addressing is mostly static and mapping updates occur much less frequently (for defect skipping) rather than constantly as with NAND, whenever data is written.  As such, problems with power loss impacting table updates are less frequent and less severe with disks.  On the other hand, stiction, vibration, and HDI are all very serious problems with rotating media but SSDs have a natural immunity to these issues.

Any new technology brings both advantages and disadvantages with it.  NAND based SSD advantages include high speed, low power, and increased ruggedness but the disadvantages involve cost and complexity.  We can sometimes tradeoff cost against complexity but we cannot eliminate it entirely.

Moreover, while we cannot eliminate the complexity of NAND wear leveling today, we can always test it better.  That’s probably the most significant message coming out of today’s issue.  Any product SSD testing has to take into account the device’s intrinsic complexity and exercise that well, under adverse conditions. Power failure is just one example, I can think of dozens more.