Repositioning of tape

HP LTO 4 Tape Media
HP LTO 4 Tape Media
In my past life, I worked for a dominant tape vendor. Over the years, we had heard a number of times that tape was dead. But it never happened. BTW, it’s also not happening today.

Just a couple of weeks ago, I was at SNW and vendor friend of mine asked if I knew anyone with tape library expertise because they were bidding on more and more tape archive opportunities. Tape seems alive and kicking for what I can see.

However, the fact is that tape use is being repositioned. Tape is no longer the direct target for backups that it once was. Most backup packages nowadays backup to disk and then later, if at all, migrate this data to tape (D2D2T). Tape is being relegated to a third tier of storage, a long-term archive and/or a long term backup repository.

The economics of tape are not hard to understand. You pay for robotics, media and drives. Tape, just like any removable media requires no additional power once it’s removed from the transport/drive used to write it. Removable media can be transported to an offsite repository or accross the continent. There it can await recall with nary an ounce (volt) of power consumed.

Problems with tape

So what’s wrong with tape, why aren’t more shops using it. Let me count the problems

  1. Tape, without robotics, requires manual intervention
  2. Tape, because of its transportability, can be lost or stolen, leading to data security breaches
  3. Tape processing, in general, is more error prone than disk. Tape can have media and drive errors which cause data transfer operations to fail
  4. Tape is accessed sequentially, it cannot be randomly accessed (quickly) and only one stream of data can be accepted per drive
  5. Much of a tape volume is wasted, never written space
  6. Tape technology doesn’t stay around forever, eventually causing data obsolescence
  7. Tape media doesn’t last forever, causing media loss and potentially data loss

Likely some other issues with tape missed here, but these seem the major ones from my perspective.

It’s no surprise that most of these problems are addressed or mitigated in one form or another by the major tape vendors, software suppliers and others interested in continuing tape technology.

Robotics can answer the manual intervention, if you can afford it. Tape encryption deals effectively with stolen tapes, but requires key management somewhere. Many applications exist today to help predict when media will go bad or transports need servicing. Tape data, is and always will be, accessed sequentially, but then so is lot’s of other data in today’s IT shops. Tape transports are most definitely single threaded but sophisticated applications can intersperse multiple streams of data onto that single tape. Tape volume stacking is old technology, not necessarily easy to deploy outside of some sort of VTL front-end, but is available. Drive and media technology obsolescence will never go away, but this indicates a healthy tape market place.

Future of tape

Say what you will about Ultrium or the Linear Tape-Open (LTO) technology, made up of HP, IBM, and Quantum research partners, but it has solidified/consolidated the mid-range tape technology. Is it as advanced as it could be, or pushing to open new markets – probably not. But they are advancing tape technology providing higher capacity, higher performance and more functionality over recent generations. And they have not stopped, Ultrium’s roadmap shows LTO-6 right after LTO-5 and delivery of LTO-5 at 1.6TB uncompressed capacity tape, is right around the corner.

Also IBM and Sun continue to advance their own proprietary tape technology. Yes, some groups have moved away from their own tape formats but that’s alright and reflects the repositioning that’s happening in the tape marketplace.

As for the future, I was at an IEEE magnetics meeting a couple of years back and the leader said that tape technology was always a decade behind disk technology. So the disk recording heads/media in use today will likely see some application to tape technology in about 10 years. As such, as long as disk technology advances, tape will come out with similar capabilities sometime later.

Still, it’s somewhat surprising that tape is able to provide so much volumetric density with decade old disk technology, but that’s the way tape works. Packing a ribbon of media around a hub, can provide a lot more volumetric storage density than a platter of media using similar recording technology.

In the end, tape has a future to exploit if vendors continue to push its technology. As a long term archive storage, it’s hard to beat its economics. As a backup target it may be less viable. Nonetheless, it still has a significant install base which turns over very slowly, given the sunk costs in media, drives and robotics.

Full disclosure: I have no active contracts with LTO or any of the other tape groups mentioned in this post.

Today's data and the 1000 year archive

Untitled (picture of a keypunch machine) by Marcin Wichary (cc) (from flickr)
Untitled (picture of a keypunch machine) by Marcin Wichary (cc) (from flickr)

Somewhere in my basement I have card boxes dating back to the 1970s and paper tape canisters dating back to the 1960s with basic, 360-assembly, COBOL, PL/1 programs on them. These could be reconstructed if needed, by reading the Hollerith encoding and typing them out into text files. Finding a compiler/assembler/interpreter to interpret and execute them is another matter. But, just knowing the logic may suffice to translate them into another readily compilable language of today. Hollerith is a data card format which is well known and well described. But what of the data being created today. How will we be able to read such data in 50 years let alone 500? That is the problem.

Vista de la Biblioteca Vasconcelos by Eneas (cc) (from flickr)
Vista de la Biblioteca Vasconcelos by Eneas (cc) (from flickr)

Civilization needs to come up with some way to keep information around for 1000 years or more. There are books relevant today (besides the Bible, Koran, and other sacred texts) that would alter the world as we know it if they were unable to be read 900 years ago. No doubt, data or information like this, being created today will survive to posterity, by virtue of its recognized importance to the world. But there are a few problems with this viewpoint:

  • Not all documents/books/information are recognized as important during their lifetime of readability
  • Some important information is actively suppressed and may never be published during a regime’s lifetime
  • Even seemingly “unimportant information” may have significance to future generations

From my perspective, knowing what’s important to the future needs to be left to future generations to decide.

Formats are the problem

Consider my blog posts, WordPress creates MySQL database entries for blog posts. Imagine deciphering MySQL database entries, 500 or 1000 years in the future and the problem becomes obvious. Of course, WordPress is open source, so this information could conceivable be readily interpretable by reading it’s source code.

I have written before about the forms that such long lived files can take but for now consider that some form of digital representation of a file (magnetic, optical, paper, etc.) can be constructed that lasts a millennia. Some data forms are easier to read than others (e.g., paper) but even paper can be encoded with bar codes that would be difficult to decipher without a key to their format.

The real problem becomes file or artifact formats. Who or what in 1000 years will be able to render a Jpeg file, able to display an old MS/Word file of 1995, or be able to read a WordPerfect file from 1985. Okay, a Jpeg is probably a bad example as it’s a standard format but, older Word and WordPerfect file formats constitute a lot of information today. Although there may be programs available to read them today, the likelihood that they will continue to do so in 50, let alone 500 years, is pretty slim.

The problem is that as applications evolve, from one version to another, formats change and developers have negative incentive to publicize these new file formats. Few developers today wants to supply competitors with easy access to convert files to a competitive format. Hence, as developers or applications go out of business, formats cease to be readable or convertable into anything that could be deciphered 50 years hence.

Solutions to disappearing formats

What’s missing, in my view, is a file format repository. Such a repository could be maintained by an adjunct of national patent trade offices (nPTOs). Just like todays patents, file formats once published, could be available for all to see, in multiple databases or print outs. Corporations or other entities that create applications with new file formats would be required to register their new file format with the local nPTO. Such a format description would be kept confidential as long as that application or its descendants continued to support that format or copyright time frames, whichever came first.

The form that a file format could take could be the subject of standards activities but in the mean time, anything that explains the various fields, records, and logical organization of a format, in a text file, would be a step in the right direction.

This brings up another viable solution to this problem – self defining file formats. Applications that use native XML as their file format essentially create a self defining file format. Such a file format could be potentially understood by any XML parser. And XML format, as a defined standard, are wide enough defined that they could conceivable be available to archivists of the year 3000. So I applaud Microsoft for using XML for their latest generation of Office file formats. Others, please take up the cause.

If such repositories existed today, people in the year 3010 could still be reading my blog entries and wonder why I wrote them…

Are RAID's days numbered?

HP/EVA drive shelfs in the HP/EVA lab in  Colo. Springs
HP/EVA drive shelfs in the HP/EVA lab in Colo. Springs
A older article that I recently came across said RAID 5 would be dead in 2009 by Robin Haris StorageMojo. In essence, it said as drives get to 1TB or more the time it took to rebuild the drive required going to RAID6.

Another older article I came across said RAID is dead, all hail the storage robot. It seemed to say that when it came to drive sizes there needed to be more flexibility and support for different capacity drives in a RAID group. Data Robotics Drobo products now support this capability which we discuss below.

I am here to tell you that RAID is not dead, not even on life support and without it the storage industry would seize up and die. One must first realize that RAID as a technology is just a way to group together a bunch of disks and to protect the data on those disks. RAID comes in a number of flavors which includes definitions for

  • RAID 0 – no protection)
  • RAID 1 – mirrored data protection
  • RAID 2 through 5 – single parity protection
  • RAID 6 and DP – dual parity protection

The rebuild time problem with RAID

The problem with drive rebuild time is that the time it takes to rebuild a 1TB or larger disk drive can be measured in hours if not days, depending on the busy-ness of the storage system and the RAID group. And of course as 1.5 and 2TB drives come online this just keeps getting longer. This can be sped up by having larger single parity RAID groups (more disk spindles in the RAID stripe), by using DP which actually has two raid groups cross-coupled (which means more disk spindles), or by using RAID 6 which often has more spindles in the RAID group.

Regardless of how you cut it there is some upper limit to the number of spindles that can be used to rebuild a failed drive – the number of active spindles in the storage subsystem. You could conceivably incorporate all these drives into a simple RAID 5 or 6 group (albeit, a very large one).

The downside of this large a RAID group is that data overwrite could potentially cause a performance bottleneck on the parity disks. That is, whenever a block is overwritten in a RAID 2-6 group, the parity for that data block (usually located on one or more other drives) has to be read, recalculated and rewritten back to the same location. Now it can be buffered, and lazily written but the data is not actually protected until parity is on disk someplace.

One way around this problem is to use a log structured file systems. Log file systems never rewrite data so there is no over-write penalty. Nicely eliminating the problem.

Alas, not everyone uses log structured file systems for backend storage. So for the rest of the storage industry the write penalty is real and needs to be managed effectively in order to not become a performance problem. One way to manage this is to limit RAID group size to a small number of drives.

So the dilemma is that in order to provide reasonable drive rebuild times you want a wide (large) RAID group with as many drives as possible in it. But in order to minimize the (over-)write penalty you want as thin (small) a RAID group as possible. How can we solve this dilemma?

Parity declustering

Parity Declustering figure from Holland&Gibson 1992 paper
Parity Declustering figure from Holland&Gibson 1992 paper

In looking at the declustered parity scheme described by Gibson and Holland in their 1992 paper. Parity and the stripe data can be spread across more drives than just in a RAID 5 or 6 group. They show an 8 drive system (see figure) where stripe data (with 3 data block sets) and parity data (of one parity block set) are rotated around a group of 8 physical drives in the array. In this way all 7 remaining drives are used to service the failed 8th drive. Some blocks will be rebuilt with one set of 3 drives and other blocks with a different set of 3 drives. As you go through the failed drives block set, rebuilding it would take all the remaining 7 drives, but not all of them would be busy for all the blocks. This should shrink the drive rebuild time considerably by utilizing more spindles.

Because parity declustering distributes the parity across a number of disk drives as well as the data no one disk would hold the parity for all drives. Doing this would eliminate the hot drive phenomenon, normally dealt with by using smaller RAID groups sizes.

The mixed drive capacity problem with RAID today

The other problem with RAID today is that it assumes a homogeneous set of disk drives in the storage array so that the same blocks/tracks/block sets could be set up as a RAID stripe across those disks used to compute parity. Now, according to the original RAID paper by Patterson, Gibson, and Katz they never explicitly stated a requirement for all the disk drives to be the same capacity but it seems easiest to implement RAID that way. With diverse capacity and performing drives you would normally want them to be in separate RAID groups. But you could create a RAID group using the least common divisor (or smallest capacity drive). However, by doing this you waste all the excess storage in the larger disks.

Now one solution to the above would be the declustered parity solution mentioned above but in the end you would need at least N-drives of the same capacity for whatever your stripe size (N) was going to be. But if you had that many drives why not just use RAID5 or 6.

Another solution popularized by Drobo is to carve up the various disk drives into RAID group segments. So if you had 4 drives with 100GB, 200GB, 400GB and 800GB, you could carve out 4 RAID groups: a 100GB RAID5 group across 4 drives; another 100GB RAID 5 group across 3 drives; a RAID 1 mirror for 200GB across the largest 2 drives; and a RAID 0 of 400GB on the largest drive. This could be configured as 4 LUNs or windows drive letters and used any way you wish.

But is this RAID?

I would say “yes”. Although this is at the subdrive level, it still looks like RAID storage, using parity and data blocks across stripes of data. All that’s been done is to take the unit of drive and make it some portion of a drive instead. Marketing aside, I think it’s an interesting concept and works well for a few drives of mixed capacity (just the market space Drobo is going after).

For larger concerns with intermixed drives I like parity declustering. It has the best of bigger RAID groups without the problems of increased activity for over-writes. Given today’s drive capacities, I might still lean towards a dual parity scheme with the parity declustering stripe but that doesn’t seem difficult to incorporate.

So when people ask if RAID’s days are numbered – my answer is a definite NO!

Latest SPECsfs2008 results – chart of the month

Top 10 SPEC(R) sfs2008 NFS throughput results as of 25Sep2009
Top 10 SPEC(R) sfs2008 NFS throughput results as of 25Sep2009

The adjacent chart is from our September newsletter and shows the top 10 NFSv3 throughput results from the latest SPEC(R) sfs2008 benchmark runs published as of 25 September 2009.

There have been a number of recent announcements of newer SPECsfs2008 results in the news of late, namely Symantec’s FileStore and Avere Systems releases but these results are not covered here. In this chart, the winner is the NetApp FAS6080 with FCAL disks behind it, clocking in at 120K NFSv3 operations/second. This was accomplished with 324 disk drives using 2-10Gbe links.

PAM comes out

All that’s interesting of course but what is even more interesting is NetApp’s results with their PAM II (Performance Accelerator Module) cards. The number 3, 4 and 5 results were all with the same system (FAS3160) with different configurations of disks and PAM II cards. Specifically,

  • The #3 result had a FAS3160, running 56 FCAL disks with PAM II cards and DRAM cache of 532GBs. The system attained 60.5K NFSv3 operations per second.
  • The #4 result had a FAS3160, running 224 FCAL disks with no PAM II cards but 20GB of DRAM cache. This system attained 60.4K NFSv3 ops/second.
  • The #5 result had a FAS3160, running 96 SATA disks with PAM II cards and DRAM cache of 532GBs. This system also attained 60.4K NFSv3 ops/second.

Similar results can be seen with the FAS3140 systems at #8, 9 and 10. In this case the FAS3140 systems were using PAM I (non-NAND) cards with 41GB of cache for results #9 and 10, while #8 result had no PAM with only 9GB of Cache. The #8 result used 224 FCAL disks, #9 used 112 FCAL disks, and #10 had 112 SATA disks. They were able to achieve 40.1K, 40.1K and 40.0K NFSv3 ops/second respectively.

Don’t know how much PAM II cards cost versus FCAL or SATA disks but there is an obvious trade off here. You can use less FCAL or cheaper SATA disks but attain the same NFSv3 ops/second performance.

As I understand it, the PAM II cards come in 256GB configurations and you can have 1 or 2 cards in a FAS system configuration. PAM cards act as an extension of FAS system cache and all IO workloads can benefit from their performance.

As with all NAND flash, write access is significantly slower than read and NAND chip reliability has to be actively managed through wear leveling and other mechanisms to create a reliable storage environment. We assume either NetApp has implemented the appropriate logic to support reliable NAND storage or has purchased NAND cache with the logic already onboard. In any case, the reliability of NAND is more concerned with write activity than read and by managing the PAM cache to minimize writes, NAND reliability concerns could easily be avoided.

The full report on the latest SPECsfs2008 results will be up on my website later this week but if you want to get this information earlier and receive your own copy of our newsletter – email me at SubscribeNews@SilvertonConsulting.com?Subject=Subscribe_to_Newsletter.

Full disclosure: I currently have a contract with NetApp on another facet of their storage but it is not on PAM or NFSv3 performance.

Cache appliances rise from the dead

XcelaSAN picture from DataRam.com website
XcelaSAN picture from DataRam.com website
Sometime back in the late 80’s a company I once worked with had a product called the tape accelerator which was nothing more than a ram cache in front of a tape device to smooth out physical tape access. The tape accelerator was a popular product for it’s time, until most tape subsystems started incorporating their own cache to do this.

At SNW in Phoenix this week, I saw a couple of vendors that were touting similar products with a new twist. They had both RAM and SSD cache and were doing this for disk only. DataRAM’s XcelaSAN was one such product although apparently there were at least two others on the floor which I didn’t talk with.

XcelaSAN is targeted for midrange disk storage where the storage subsystems have limited amount’s of cache. Their product is Fibre Channel attached and lists for US$65K per subsystem. Two appliances can be paired together for high availability. Each appliance has 8-4GFC ports on it, with 128GB of DRAM and 360GB of SSD cache.

I talked to them a little about their caching algorithms. They claim to have sequential detect, lookahead and other sophisticated caching capabilities but the proof is in the pudding. It would be great to put this in front of a currently SPC benchmarked storage subsystem and see how much it accelerates it’s SPC-1 or SPC-2 results, if at all.

From my view, this is yet another economic foot race. Most new mid range storage subsystems today ship with 8-16GB of DRAM cache and varied primitive caching algorithms. DataRAM’s appliance has considerably more cache but at these prices it would need to be amortized over a number of mid range subsystems to be justified.

Enterprise class storage subsystems have a lot of RAM cache already, but most use SSDs as storage tier and not a cache tier (except for NetApp’s PAM card). Also, we

  • Didn’t talk much about the reliability of their NAND cache or whether they were using SLC or MLC but these days with workloads approaching 1:1 read:write ratios. IMHO, having some SSD in the system for heavy reads are good but you need RAM for the heavy write workloads.
  • Also what happens when the power fails is yet another interesting question to ask. Most subsystem caches have battery backup or non-volatile RAM sufficient to get data written to RAM out to some more permanent storage like disk. In these appliances perhaps they just write it to SSD.
  • Also what happens when the storage subsystem power fails and the appliance stays up. Sooner or later you have to go back to the storage to retrieve or write the data

In my view, none of these issues are insurmountable but take clever code to get around. Knowing how clever there appliance developers are is hard to judge from the outside. Quality is often as much a factor of testing as it is a factor of development (see my Price of Quality post to learn more on this).

Also, most often caching algorithms are very tailored to the storage subsystem that surrounds it. But this isn’t always necessary. Take IBM SVC or HDS USP-V both of which can add a lot of cache in front of other storage subsystems. But these products also offer storage virtualization which the caching appliances do not provide.

All in all, I feel this is a good direction to take but it’s somewhat time limited until the midrange storage subsystems start becoming more cache intensive/knowledgeable. At that time these products will once again fall into the background. But in the meantime they can have a viable market benefit for the right storage environment.

Sidekick's failure, no backups

Sidekick 2 "Skinit" by grandtlairdjr (cc) (from flickr)
Sidekick 2 "Skinit" by grandtlairdjr (cc) (from flickr)

I believe I have covered this ground before but apparently it needs reiterating. Cloud storage without backup cannot be considered a viable solution. Replication only works well if you never delete or logically erase data from a primary copy. Once that’s done the data is also lost in all replica locations soon afterwards.

I am not sure what happened with the sidekick data, whether somehow a finger check deleted it or some other problem but from what I see looking in from the outside – there were no backups, no offline copies, no fall back copies of the data that weren’t part of the central node and it’s network of replicas. When that’s the case disaster is sure to ensue.

At the moment the blame game is going around to find out who is responsible and I hear that some of the data may be being restored. But that’s not the problem. Having no backups that are not part of the original storage infrastructure/environment are the problem. Replicas are never enough. Backups have to be elsewhere to count as backups.

What would have happened if they had backups is that the duration of the outage would have been the length of time it took to retrieve and restore the data and some customer data would have been lost since the last backup but that would have been it. It wouldn’t be the first time backups had to be used and it won’t be the last. But without backups at all, then you have a massive customer data loss that cannot be recovered from.

This is unacceptable. It gives IT a bad name, puts a dark cloud over cloud computing and storage and makes the IT staff of sidekick/danger look bad or worse incompetent naive.

All of you cloud providers need to take heed. You can do better. Backup software/services can be used to backup this data and we will all be better served because of it.

BBC and others now report that most of the Sidekick data will be restored. I am glad that they found a way to recover their “… data loss in the core database and the back up.” and have “… installed a more resilient back-up process” for their customer data.

Some are saying that the backups just weren’t accessible but until the whole story comes out I will withhold judgement. Just glad to have another potential data loss be prevented.

Symantec's FileStore

Picture of old filing shelves to hold spare parts
Data Storage Device by BinaryApe (cc) (from flickr)
Earlier this week Symantec GA’ed their Veritas FileStore software. This software was an outgrowth of earlier Symantec Veritas Cluster File System and Storage Foundation software which were combined with new frontend software to create scaleable NAS storage.

FileStore is another scale-out, cluster file system (SO/CFS) implemented as NAS head via software. The software runs on a hardened Linux OS and can run on any commodity x86 hardware. It can be configured with up to 16 nodes. Also, it currently supports any storage supported by Veritas Storage Foundation which includes FC, iSCSI, and JBODs. Symantec claims FileStoreo has the broadest storage hardware compatibility list in the industry for a NAS head.

As a NAS head FileStore supports NFS, CIFS, HTTP, and FTP file services and can be configured to support anywhere from under a TB to over 2PB of file storage. Currently FileStore can support up to 200M files per file system, up to 100K file systems, and over 2PB of file storage.

FileStore nodes work in an Active-Active configuration. This means any node can fail and the other, active nodes will take over providing the failed node’s file services. Theoretically this means that in a 16 node system, 15 nodes could fail and the lone remaining node could continue to service file requests (of course performance would suffer considerably).

As part of cluser file system, FileStore support quick failover of active nodes. This can be accomplished in under 20 seconds. In addition, FileStore supports asynchronous replication to other FileStore clusters to support DR and BC in the event of a data center outage.

One of the things that FileStore brings to the table is that as it’s running standard Linux O/S services. This means other Symantec functionality can also be hosted on FileStore nodes. The first Symantec service to be co-hosted with FileStore functionality is NetBackup Advanced Client services. Such a service can have the FileStore node act as a media server for it’s own backup cutting network traffic required to do a backup considerably.

FileStore also supports storage tiering whereby files can be demoted and promoted between storage tiers in the multi-volume file system. Also, Symantec EndPoint Protection can be hosted on a FileStore node provided anti-virus protection completely onboard. Other Symantec capabilities will soon follow to add to the capabilities already available.

FileStore’s NFS performance

Regarding performance, Symantec has submitted a 12 node FileStore system for SPECsfs2008 NFS performance benchmark. I looked today to see if it was published yet and it’s not available but they claim to currently be the top performer for SPECsfs2008 NFS operations. I asked about CIFS and they said they had yet to submit one. Also they didn’t mention what the backend storage looked like for the benchmark, but one can assume it had lots of drives (look to the SPECsfs2008 report whenever it’s published to find out).

In their presentation they showed a chart depicting FileStore performance scaleability. According to this chart, at 16 nodes, the actual NFS Ops performance was 93% of theoretical NFS Ops performance. In my view, scaleability is great but often as you approach some marginal utility as the number of nodes increases, the net performance improvement decreases. The fact that they were able to hit 93% with 16 nodes of what a linear extrapolation of NFS ops performance was from 2 to 8 nodes is pretty impressive. (I asked to show the chart but hadn’t heard back by post time

Pricing and market space

At the lowend, FileStore is meant to compete with Windows Storage Server and would seem to provide better performance and availability versus Windows. At the high end, I am not sure but the competition would be with HP/PolyServe and standalone NAS heads from EMC and NetApp/IBM and others. List pricing is about US$7K/node and that top performing SPECsfs2008 12-node system would set you back about $84K for the software alone (please note that list pricing <> street pricing). You would need to add node hardware and the storage hardware to provide a true apples-to-apples pricing comparison with other NAS storage.

As far as current customers they range from large from the high end (>1PB) E-retailers to SAAS providers (Symantec SAAS offering), and at the low end (<10TB) universities and hospitals. FileStore with it’s inherent scaleability and ability to host storage applications from Symantec on the storage nodes can offer a viable solution to many hard file system problems.

We have discussed scale-out and cluster file systems (SO/CFS) in a prior post (Why SO/CFS, Why Now) so I won’t elaborate on why they are so popular today. But, suffice it to say Cloud and SAAS will need SO/CFS to be viable solutions and everybody is responding to supply that market as it emerges.

Full disclosure: I currently have no active or pending contracts with Symantec.