Tape vs. Disk, the saga continues

Inside a (Spectra Logic) T950 library by ChrisDag (cc) (from Flickr)
Inside a (Spectra Logic) T950 library by ChrisDag (cc) (from Flickr)

Was on a call late last month where Oracle introduced their latest generation T1000C tape system (media and drive) holding 5TB native (uncompressed) capacity. In the last 6 months I have been hearing about the coming of a 3TB SATA disk drive from Hitachi GST and others. And last month, EMC announced a new Data Domain Archiver, a disk only archive appliance (see my post on EMC Data Domain products enter the archive market).

Oracle assures me that tape density is keeping up if not gaining on disk density trends and capacity. But density or capacity are not the only issues causing data to move off of tape in today’s enterprise data centers.

“Dedupe Rulz”

A problem with the data density trends discussion is that it’s one dimensional (well literally it’s 2 dimensional). With data compression, disk or tape systems can easily double the density on a piece of media. But with data deduplication, the multiples start becoming more like 5X to 30X depending on frequency of full backups or duplicated data. And number’s like those dwarf any discussion of density ratios and as such, get’s everyone’s attention.

I can remember talking to an avowed tape enginerr, years ago and he was describing deduplication technology at the VTL level as being architecturally inpure and inefficient. From his perspective it needed to be done much earlier in the data flow. But what they failed to see was the ability of VTL deduplication to be plug-compatible with the tape systems of that time. Such ease of adoption allowed deduplication systems to build a beach-head and economies of scale. From there such systems have no been able to move up stream, into earlier stages of the backup data flow.

Nowadays, what with Avamar, Symantec Pure Disk and others, source level deduplication, or close by source level deduplication is a reality. But all this came about because they were able to offer 30X the density on a piece of backup storage.

Tape’s next step

Tape could easily fight back. All that would be needed is some system in front of a tape library that provided deduplication capabilities not just to the disk media but the tape media as well. This way the 30X density over non-deduplicated storage could follow through all the way to the tape media.

In the past, this made little sense because a deduplicated tape would require potentially multiple volumes in order to restore a particular set of data. However, with today’s 5TB of data on a tape, maybe this doesn’t have to be the case anymore. In addition, by having a deduplication system in front of the tape library, it could support most of the immediate data restore activity while data restored from tape was sort of like pulling something out of an archive and as such, might take longer to perform. In any event, with LTO’s multi-partitioning and the other enterprise class tapes having multiple domains, creating a structure with meta-data partition and a data partition is easier than ever.

“Got Dedupe”

There are plenty of places, that today’s tape vendors can obtain deduplication capabilities. Permabit offers Dedupe code for OEM applications for those that have no dedupe systems today. FalconStor, Sepaton and others offer deduplication systems that can be OEMed. IBM, HP, and Quantum already have tape libraries and their own dedupe systems available today all of which can readily support a deduplicating front-end to their tape libraries, if they don’t already.

Where “Tape Rulz”

There are places where data deduplication doesn’t work very well today, mainly rich media, physics, biopharm and other non-compressible big-data applications. For these situations, tape still has a home but for the rest of the data center world today, deduplication is taking over, if it hasn’t already. The sooner tape get’s on the deduplication bandwagon the better for the IT industry.

—-

Of course there are other problems hurting tape today. I know of at least one large conglomerate that has moved all backup off tape altogether, even data which doesn’t deduplicate well (see my previous Oracle RMAN posts). And at least another rich media conglomerate that is considering the very same move. For now, tape has a safe harbor in big science, but it won’t last long.

Comments?

What’s wrong with tape?

StorageTek Automated Cartridge System by brewbooks (cc) (from Flickr)
StorageTek Automated Cartridge System by brewbooks (cc) (from Flickr)

Was on a conference call today with Oracle’s marketing discussing their tape business.  Fred Moore (from Horison Information Systems) was on the call and mentioned something which surprised me.  What’s missing in open and distributed systems was some standalone mechanism to stack volumes onto a single tape cartridge.

The advantages of tape are significant, namely:

  • Low power utilization for offline or nearline storage
  • Cheap media, drives, and automation systems
  • Good sequential throughput
  • Good cartridge density

But most of these advantages fade when cartridge capacity utilization drops.  One way to increase cartridge capacity utilization is to stack multiple tape volumes on a single cartridge.

Mainframes (like system/z) have had cartridge stacking since the late 90’s.  Such capabilities came about due to the increasing cartridge capacities then available. Advance a decade and the problem still exists, Oracle’s StorageTek T10000 has a 1TB cartridge capacity and LTO-5 supports 1.5TB per cartridge both uncompressed.  Nonetheless, open or distributed systems still have no tape stacking capability.

Although I agree with Fred that volume stacking is missing in open systems, but does it really need such a thing.  Currently it seems open systems uses tape for backups, archive data and the occasional batch run.  Automated hierarchical storage management can readily fill up tape cartridges by holding their data movement to tape until enough data is ready to be moved.  On the other hand, backups by their very nature create large sequential streams of data which should result in high capacity utilization except for the last tape in a series.  Which only leaves the problem of occasional batch runs using large datasets or files.

I believe most batch processing today already takes place on the mainframe, leaving relatively little for open or distributed systems.  There are certainly some verticals that do lots of batch processing, for example banks and telcos.  But most heavy batch users grew up in the heyday of the mainframe and are still using them today.

Condor notwithstanding, open and distributed systems never had any sophisticated batch processing capabilities readily available on the mainframe. As such, of those new companies that need batch processing, my guess is that they start with open and as their needs for batch grow move these applications to mainframe.

So the real question becomes how do we increase open systems batch processing.   I don’t think a tape volume stacking system solves that problem.

Given all the above, I see tape use in open being relegated to backup and archive and used less and less for any other activities.

What do you think?