Top 10 blog posts for 2011

Merry Christmas! Buon Natale! Frohe Weihnachten! by Jakob Montrasio (cc) (from Flickr)
Merry Christmas! Buon Natale! Frohe Weihnachten! by Jakob Montrasio (cc) (from Flickr)

Happy Holidays.

I ranked my blog posts using a ratio of hits to post age and have identified with the top 10 most popular posts for 2011 (so far):

  1. Vsphere 5 storage enhancements – We discuss some of the more interesting storage oriented Vsphere 5 announcements that included a new DAS storage appliance, host based (software) replication service, storage DRS and other capabilities.
  2. Intel’s 320 SSD 8MB problem – We discuss a recent bug (since fixed) which left the Intel 320 SSD drive with only 8MB of storage, we presumed the bug was in the load leveling logic/block mapping logic of the drive controller.
  3. Analog neural simulation or digital neuromorphic computing vs AI – We talk about recent advances to providing both analog (MIT) and digital versions (IBM) of neural computation vs. the more traditional AI approaches to intelligent computing.
  4. Potential data loss using SSD RAID groups – We note the possibility for catastrophic data loss when using equally used SSDs in RAID groups.
  5. How has IBM researched changed – We examine some of the changes at IBM research that have occurred over the past 50 years or so which have led to much more productive research results.
  6. HDS buys BlueArc – We consider the implications of the recent acquisition of BlueArc storage systems by their major OEM partner, Hitachi Data Systems.
  7. OCZ’s latest Z-Drive R4 series PCIe SSD – Not sure why this got so much traffic but its OCZ’s latest PCIe SSD device with 500K IOPS performance.
  8. Will Hybrid drives conquer enterprise storage – We discuss the unlikely possibility that Hybrid drives (NAND/Flash cache and disk drive in the same device) will be used as backend storage for enterprise storage systems.
  9. SNIA CDMI plugfest for cloud storage and cloud data services – We were invited to sit in on a recent SNIA Cloud Data Management Initiative (CDMI) plugfest and talk to some of the participants about where CDMI is heading and what it means for cloud storage and data services.
  10. Is FC dead?! – What with the introduction of 40GbE FCoE just around the corner, 10GbE cards coming down in price and Brocade’s poor YoY quarterly storage revenue results, we discuss the potential implications on FC infrastructure and its future in the data center.

~~~~

I would have to say #3, 5, and 9 were the most fun for me to do. Not sure why, but #10 probably generated the most twitter traffic. Why the others were so popular is hard for me to understand.

Comments?

Intel’s 320 SSD “8MB problem”

Intel SSD 320_001 by TAKA@P.P.R.S (cc) (from Flickr)
Intel SSD 320_001 by TAKA@P.P.R.S (cc) (from Flickr)

Read a recent news item on Intel being able to replicate their 320 SSD 8MB problem that some customers have been complaining about.

Apparently the problem occurs when power is suddenly removed from the device.  The end result is that the SSD’s capacity is restricted to 8MB from 40GB or more.

I have seen these sorts of problems before.  It probably has something to do with table updating activity associated with SSD wear leveling.

Wear leveling

NAND wear leveling looks very similar to virtual memory addressing and maps storage block addresses to physical NAND pages. Essentially something similar to a dynamic memory page table is maintained that shows where the current block is located in the physical NAND space, if present.  Typically, there are multiple tables involved, one for spare pages, another for mapping current block addresses to NAND page location and offset, one for used pages, etc.  All these tables have to be in some sort of non-volatile storage so they persist after power is removed.

Updating such tables and maintaining their integrity is a difficult endeovor.  More than likely some sort of table update is not occurring in an ACID fashion.

Intel’s fix

Intel has replicated the problem and promises a firmware fix. In my experience this is entirely possible.  Most probably customer data has not been lost (although this is not a certainty), it’s just not accessible at the moment. And Intel has reminded everyone that as with any storage device everyone should be taking periodic backups to other devices, SSDs are no exception.

I am certain that Intel and others are enhancing their verification and validation (V&V) activities to better probe and test the logic behind wear leveling fault tolerance, at least with respect to power loss. Of course, redesigning the table update algorithm to be more robust, reliable, and fault tolerant is a long range solution to these sorts of problems but may take longer than a just a bug fix.

The curse of complexity

But all this begs a critical question, as one puts more and more complexity outboard into the drive are we inducing more risk?

It’s a perennial problem in the IT industry. Software bugs are highly correlated to complexity and thereby, are ubiquitous, difficult to eliminate entirely, and often escape any and all efforts to eradicate them before customer shipments.  However, we can all get better at reducing bugs, i.e., we can make them less frequent, less impactful, and less visible.

What about disks?

All that being said, rotating media is not immune to the complexity problem. Disk drives have different sorts of complexity, e.g., here block addressing is mostly static and mapping updates occur much less frequently (for defect skipping) rather than constantly as with NAND, whenever data is written.  As such, problems with power loss impacting table updates are less frequent and less severe with disks.  On the other hand, stiction, vibration, and HDI are all very serious problems with rotating media but SSDs have a natural immunity to these issues.

—-
Any new technology brings both advantages and disadvantages with it.  NAND based SSD advantages include high speed, low power, and increased ruggedness but the disadvantages involve cost and complexity.  We can sometimes tradeoff cost against complexity but we cannot eliminate it entirely.

Moreover, while we cannot eliminate the complexity of NAND wear leveling today, we can always test it better.  That’s probably the most significant message coming out of today’s issue.  Any product SSD testing has to take into account the device’s intrinsic complexity and exercise that well, under adverse conditions. Power failure is just one example, I can think of dozens more.

Comments?

Testing storage systems – Intel’s SSD fix

Intel’s latest (35nm NAND) SSD shipments were halted today because a problem was identified when modifying BIOS passwords (see IT PRO story). At least they gave a timeframe for a fix – a couple of weeks.

The real question is can products be tested sufficiently these days to insure they work in the field. Many companies today will ship product to end-user beta testers to work out the bugs before the product reaches the field. But beta-testing has got to be complemented with active product testing and validation. As such, unless you plan to get 100s or perhaps 1000s of beta testers you could have a serious problem with field deployment.

And therein lies the problem, software products are relatively cheap and easy to beta test, just set up a download site and have at it. But with hardware products beta testing actually involves sending product to end-users which costs quite a bit more $’s to support. So I understand why Intel might be having problems with field deployment.

So if you can’t beta test hardware products as easily as software – then you have to have a much more effective test process. Functional testing and validation is more of an art than a science and can cost significant $’s and more importantly, time. All of which brings us back to some form of beta testing.

Perhaps Intel could use their own employees as beta testers rotating new hardware products from one organization to another, over time to get some variability in the use of a new product. Many companies use their new product hardware extensively in their own data centers to validate functionality prior to shipment. In the case of Intel’s SSD drives these could be placed in the in-numberable servers/desktops that Intel no-doubt has throughout it’s corporation.

One can argue whether beta testing takes longer than extensive functional testing. However given today’s diverse deployments, I believe beta testing can be a more cost effective process when done well.

Intel is probably trying to figure out just what went wrong in their overall testing process today. I am sure, given their history, they will do better next time.