e-pathology and data growth

Blue nevus (4 of 4) by euthman (cc) (From Flickr)
Blue nevus (4 of 4) by euthman (cc) (From Flickr)

I was talking with another analyst the other day by the name of John Koller of Kai Consulting who specializes in the medical space and he was talking about the rise of electronic pathology (e-pathology).  I hadn’t heard about this one.

He said that just like radiology had done in the recent past, pathology investigations are moving to make use of digital formats.

What does that mean?

The biopsies taken today for cancer and disease diagnosis which involve one more specimens of tissue examined under a microscope will now be digitized and the digital files will be inspected instead of the original slide.

Apparently microscopic examinations typically use a 1×3 inch slide that can have the whole slide devoted to some tissue matter.  To be able to do a pathological examination, one has to digitize the whole slide, under magnification at various depths within the tissue.  According to Koller, any tissue is essentially a 3D structure and pathological exams, must inspect different depths (slices) within this sample to form their diagnosis.

I was struck by the need for different slices of the same specimen. I hadn’t anticipated that but whenever I look in a microscope, I am always adjusting the focal length, showing different depths within the slide.   So it makes sense, if you want to understand the pathology of a tissue sample, multiple views (or slices) at different depths are a necessity.

So what does a slide take in storage capacity?

Koller said, an uncompressed, full slide will take about 300GB of space. However, with compression and the fact that most often the slide is not completely used, a more typical space consumption would be on the order of 3 to 5GB per specimen.

As for volume, Koller indicated that a medium hospital facility (~300 beds) typically does around 30K radiological studies a year but do about 10X that in pathological studies.  So at 300K pathological examinations done a year, we are talking about 90 to 150TB of digitized specimen images a year for a mid-sized hospital.

Why move to  e-pathology?

It can open up a whole myriad of telemedicine offerings similar to the radiological study services currently available around the globe.  Today, non-electronic pathology involves sending specimens off to a local lab and examination by medical technicians under microscope.  But with e-pathology, the specimen gets digitized (where, the hospital, the lab, ?) and then the digital files can be sent anywhere around the world, wherever someone is qualified and available to scrutinize them.


At a recent analyst event we were discussing big data and aside from the analytics component and other markets, the vendor made mention of content archives are starting to explode.  Given where e-pathology is heading, I can understand why.

It’s great to be in the storage business

Coming data bubble or explosion?

World population by Arenamontanus (cc) (from Flickr)
World population by Arenamontanus (cc) (from Flickr)

I was at another conference the other day where someone showed a chart that said the world will create 35ZB (10**21) of data and content in 2020 from 800EB (10**18) in 2009.

Every time I see something like this I cringe.   Yes, lot’s of data is being created today but what does that tell us about corporate data growth.  Not much, I’d wager.

Data bubble

That being said, I have a couple of questions I would ask of the people who estimated this:

  • How much is personal data and how much is corporate data.
  • Did you factor how entertainment data growth rates will change over time.

These two questions are crucial.

Entertainment dominates data growth

Just as personal entertainment is becoming the major consumer of national bandwidth (see study [requires login]), it’s clear to me that the majority of the data being created today is for personal consumption/entertainment – video, music, and image files.

I look at my own office, our corporate data (office files, PDFs, text, etc.) represents ~14% of the data we keep.  Images, music, video, audio take up the remainder of our data footprint.  Is this data growing yes, faster than I would like but the corporate data is only averaging ~30% YoY growth while the overall data growth for our shop is averaging a total of ~116% YoY growth . [As I interrupt this activity to load up another 3.3GB of photos and videos from our camera]

Moreover, although some media content is of significant external interest to select (Media and Entertainment, social media-photo/video sharing sites, mapping/satellite, healthcare, etc.) companies today, most corporations don’t deal with lot’s of video, music or audio data.  Thus, I personally see that the 30% growth is a more realistic growth rate for corporate data than 116%.

Will entertainment data growth flatten?

Will we see a drop in the entertainment data growth rates over time, undoubtedly.

Two factors will reduce the growth of this data.

  1. What happens to entertainment data recording formats.  I believe media recording formats are starting to level out.  I think the issue here is one of fidelity to nature, in terms of how closely a digital representation matches reality as we perceive it.  For example, the fact is that  most digital projection systems in movie theaters today run from ~2 to 8TBs per feature length motion picture which seems to indicate that at some point further gains in fidelity (or in more pixels/frame) may not be worth it.  Similar issues, will ultimately lead to a slowing down of other media encoding formats.
  2. When will all the people that can create content be doing so? Recent data indicates that more than 2B people will be on the internet this year or ~28% of the world’s.  But sometime we must reach saturation on internet penetration and when that happens data growth rates should also start to level out.  Let’s say for argument sake, that 800EB in 2009 was correct and let’s assume there were 1.5B internet users (in 2009).  As such, 1B internet users correlates to a data and content footprint of about 533EB or ~0.5TB/internet user — seems high but certainly doable.

Once these two factors level off, we should see world data and content growth rates plummet.  Nonetheless, internet user population growth could be driving data growth rates for some time to come.

Data explosion

The scary part is that the 35ZB represents only a ~41% growth rate over the period against the baseline 2009 data and content creation levels.

But I must assume this estimate doesn’t consider much growth in digital creators of content, otherwise these numbers should go up substantially.   In the last week, I ran across someone who said there would be 6B internet users by the end of the decade (can’t seem to recall where, but it was a TEDx video).  I find that a little hard to believe but this was based on the assumption that most people will have smart phones with cellular data plans by that time.  If that be the case, 35ZB seems awfully short of  the mark.

A previous post blows this discussion completely away with just one application, (see Yottabytes by 2015 for the NSA A Yottabyte (YB) is 10**24 bytes of data) and I had already discussed an Exabyte-a-day and 3.3 Exabytes-a-day in prior posts.  [Note, those YB by 2015 are all audio (phone) recordings but if we start using Skype Video, FaceTime and other video communications technologies can Nonabytes (10**27) be far behind… BOOM!]


I started out thinking that 35ZB by 2020 wasn’t pertinent to corporate considerations and figured things had to flatten out, then convinced myself that it wasn’t large enough to accommodate internet user growth, and then finally recalled prior posts that put all this into even more perspective.


EMCWorld news Day1 1st half

EMC World keynote stage, storage, vblocks, and cloud...
EMC World keynote stage, storage, vblocks, and cloud...

EMC announced today a couple of new twists on the flash/SSD storage end of the product spectrum.  Specifically,

  • They now support all flash/no-disk storage systems. Apparently they have been getting requests to eliminate disk storage altogether. Probably government IT but maybe some high-end enterprise customers with low-power, high performance requirements.
  • They are going to roll out enterprise MLC flash.  It’s unclear when it will  be released but it’s coming soon, different price curve, different longevity (maybe), but brings down the cost of flash by ~2X.
  • EMC is going to start selling server side Flash.  Using storage FAST like caching algorithms to knit the storage to the server side Flash.  Unclear what server Flash they will be using but it sounds a lot like a Fusion-IO type of product.  How well the server cache and the storage cache talks is another matter.  Chuck Hollis said EMC decided to redraw the boundary between storage and server and now there is a dotted line that spans the SAN/NAS boundary and carves out a piece of the server which is sort of on server caching.

Interesting to say the least.  How well it’s tied to the rest of the FAST suite is critical. What happens when one or the other loses power, as Flash is non-volatile no data would be lost but the currency of the data for shared storage may be another question.  Also having multiple servers in the environment may require cache coherence across the servers and storage participating in this data network!?

Some teaser announcements from Joe’s keynote:

  • VPLEX asynchronous, active active supporting two datacenter access to the same data over 1700Km away Pittsburgh to Dallas.
  • New Isilon record scalability and capacity the NL appliance. Can now support a 15PB file system, with trillions of files in it.  One gene sequencer says a typical assay generates 500M objects/files…
  • Embracing Hadoop open source products so that EMC will support Hadoop distro in an appliance or software only solution

Pat G also showed EMC Greenplum appliance searching a 8B row database to find out how many products have been shipped to a specific zip code…



Initial impressions on Spring SNW/Santa Clara

I heard storage beers last nite was quite the party, sorry I couldn’t make it but I did end up at the HDS customer reception which was standing room only and provided all the food and drink I could consume.

Saw quite a lot of old friends too numerous to mention here but they know who they are.

As for technology on display there was some pretty impressive stuff.

Verident card (c) 2011 Silverton Consulting, Inc.
Verident card (c) 2011 Silverton Consulting, Inc.

Lots of great technology on display there.

Virident tachIOn SSD

One product that caught my eye was from Virident, their tachIOn SSD. I called it a storage subsystem on a board.  I had never talked with them before but they have been around for a while using NOR storage but now are focused on NAND.

Their product is a fully RAIDed storage device using flash aware RAID 5 parity locations, their own wear leveling and other SSD control software and logic with replaceable NAND modules.

Playing with this device I felt like I was swapping drives of the future. Each NAND module stack has a separate controller and supports high parallelism.  Talking with Shridar Subramanian, VP of marketing, he said the product is capable of over 200K IOPS running a fully 70% read:30% write workload at full capacity.

They have a Capacitor backed DRAM buffer which is capable of uploading the memory buffer to NAND after a power failure. It plugs into a PCIe slot and uses less than 25W of power, in capacities of 300-800GB.  It requires a software driver, they currently only support Linux and VMware (a Linux varient) but Windows and other O/Ss are on the way

Other SSDs/NAND storage

Their story was a familair refrain throughout the floor, lots of SSD/NAND technology coming out, in various formfactors.  I saw one system using SSDs from Viking Modular Systems that fit into a DRAM DIMM slot and supported a number of SSDs behind a SAS like controller. Also requiring a SW driver.

(c) 2011 Silverton Consulting, Inc.
(c) 2011 Silverton Consulting, Inc.

Of course TMS, Fusion-IO, Micron, Pliant and others were touting their latest SSD/Nand based technology showing off their latest solutions and technology.   For some reason lots of SSD’s at this show.

Naturally, all the other storage vendors were there Dell, HDS, HP, EMC, NetApp and IBM. IBM was showing off Watson, their new AI engine that won at Jeopardy.

And then there was cloud, …

Cloud was a hot topic as well. Saw one guy in the corner I have talked about before StorSimple which is a cloud gateway provider.  They said they are starting to see some traction in the enterprise. Apparently enterprise are starting to adopt cloud – who knew?

Throw in a few storage caching devices, …

Then of course there was the data caching products which ranged from the relaunched DataRAM XcelASAN to Marvel’s new DragonFLY card.  DragonFLY provides a cache on a PCI-E card which DataRAM is a FC caching appliance, all pretty interesting.

… and what’s organic storage?

And finally, Scality came out of the shadows with what they are calling an organic object storage device.  The product reminded me of Bycast (now with NetApp) and Archivas (now with HDS) in that they had a RAIN architecture, with mirrored data in an object store interface.  I asked them what makes them different and Jerome Lecat, CEO said they are relentlessly focused on performance and claims they can retrieve an object in under 40msec.  My kind of product.  I think they deserve a deeper dive sometime later.


Probably missed a other  vendors but these are my initial impressions.  For some reason I felt right at home swapping NAND drive modules,…



SSD market dynamics

Toshiba's 2.5" SSD (from SSD.Toshiba.com)
Toshiba's 2.5" SSD (from SSD.Toshiba.com)

Had a talk the other week with an storage executive about SSD and NAND cost trends.  It seemed that everyone thought that $/GB for SSD was going to overtake (be less costly) than enterprise class disk sometime in 2013.  But it appeared that NAND costs weren’t coming down as fast as anticipated and now this was going to take longer than expected.

A couple of other things are going on in the enterprise disk market that are also having an effect on the relative advantage of SSDs over disks.  Probably, most concerning to SSD market is enterprise storage’s new penchant for sub-LUN tiering.

Automated sub-LUN storage tiering

The major storage vendors all currently support some form of automated storage tiering for SSD storage (NetApp’s Flash Cache does this differently but the impact on NAND storage requirements is arguably similar).  Presumably, such tiering should take better advantage of any amount of SSD/NAND storage available to a storage system.

Prior to automated sub-LUN storage tiering, one had to move a whole LUN to SSDs to take advantage of its speed. However, I/O requests or access are not necessarily at the same intensity for all blocks of a LUN.  So one would typically end up with an SSD LUN with a relatively few blocks being heavily accessed while the vast majority of its blocks would not be being hit that much.  We paid the high price of SSD LUNs gladly to get the high performance for those few blocks that really needed it.

However, with sub-LUN tiering or NAND caching, one no longer has to move all the blocks of a LUN into NAND storage to gain its benefits.  One can now just have the system identify those select blocks which need high performance and move those blocks and those blocks only to NAND storage.  The net impact of sub-LUN tiering or NAND caching is that one should require less overall NAND storage to obtain the same performance as one had previously with SSDs alone.

On the other hand, some would say that making the performance advantages of NAND be available at a lower overall cost might actually increase the overall amount of NAND shipments. Also with automated sub-LUN tiering in place, this removes all the complexity needed previously to identify which LUNs needed higher performance.  Reducing such complexity should increase SSD or NAND market penetration.

Nonetheless, I feel that given todays price differential of SSDs over enterprise disk, the people buying SSDs today have a very defined need for speed and would have paid the price anyways for SSD storage.  Anything we do to make satisfying that need with less SSD or NAND storage should reduce the amount of SSDs shipped today.

But getting back to that price crossover point, as the relative price of NAND on $/GB comes down, having an easy way to take advantage of  its better performance should increase its market adoption, even faster than price would do alone.


When will disks become extinct?

A head assembly on a Seagate disk drive by Robert Scoble (cc) (from flickr)
A head assembly on a Seagate disk drive by Robert Scoble (cc) (from flickr)

Yesterday, it was announced that Hitachi General Storage Technologies (HGST) is being sold to Western Digital for $4.3B and after that there was much discussion in the tweeterverse about the end of enterprise disk as we know it.  Also, last week I was at a dinner at an analyst meeting with Hitachi, where the conversation turned to when disks will no longer be available. This discussion was between Mr. Takashi Oeda of Hitachi RSD, Mr. John Webster of Evaluator group and myself.

Why SSDs will replace disks

John was of the opinion that disks would stop being economically viable in about 5 years time and will no longer be shipping in volume, mainly due to energy costs.  Oeda-san said that Hitachi had predicted that NAND pricing on a $/GB basis would cross over (become less expensive than) 15Krpm disk pricing sometime around 2013.  Later he said that NAND pricing had not come down as fast as projected and that it was going to take longer than anticipated.  Note that Oeda-san mentioned density price cross over for only 15Krpm disk not 7200rpm disk.  In all honesty, he said SATA disk would take longer, but he did not predict when

I think both arguments are flawed:

  • Energy costs for disk drives drop on a Watts/GB basis every time disk density increases. So the energy it takes to run a 600GB drive today will likely be able to run a 1.2TB drive tomorrow.  I don’t think energy costs are going to be the main factor to drives disks out of the enterprise.
  • Density costs for NAND storage are certainly declining but cost/GB is not the only factor in technology adoption. Disk storage has cost more than tape capacity since the ’50s, yet they continue to coexist in the enterprise. I contend that disks will remain viable for at least the next 15-20 years over SSDs, primarily because disks have unique functional advantages which are vital to enterprise storage.

Most analysts would say I am wrong, but I disagree. I believe disks will continue to play an important role in the storage hierarchy of future enterprise data centers.

NAND/SSD flaws from an enterprise storage perspective

All costs aside, NAND based SSDs have serious disadvantages when it comes to:

  • Data retention – the problem with NAND data cells is that they can only be written so many times before they fail.  And as NAND cells become smaller, this rate seems to be going the wrong way, i.e,  today’s NAND technology can support 100K writes before failure but tomorrow’s NAND technology may only support 15K writes before failure.  This is not a beneficial trend if one is going to depend on NAND technology for the storage of tomorrow.
  • Sequential access – although NAND SSDs perform much better than disk when it comes to random reads and less so, random writes, the performance advantage of sequential access is not that dramatic.  NAND sequential access can be sped up by deploying multiple parallel channels but it starts looking like internal forms of wide striping across multiple disk drives.
  • Unbalanced performance – with NAND technology, reads operate quicker than writes. Sometimes 10X faster.  Such unbalanced performance can make dealing with this technology more difficult and less advantageous than disk drives of today with much more balanced performance.

None of these problems will halt SSD use in the enterprise. They can all be dealt with through more complexity in the SSD or in the storage controller managing the SSDs, e.g., wear leveling to try to prolong data retention, multi-data channels for sequential access, etc. But all this additional complexity increases SSD cost, and time to market.

SSD vendors would respond with yes it’s more complex, but such complexity is a one time charge, mostly a one time delay, and once done, incremental costs are minimal. And when you come down to it, today’s disk drives are not that simple either with defect skipping, fault handling, etc.

So why won’t disk drives go away soon.  I think other major concern in NAND/SSD ascendancy is the fact that the bulk NAND market is moving away from SLC (single level cell or bit/cell) NAND to MLC (multi-level cell) NAND due to it’s cost advantage.  When SLC NAND is no longer the main technology being manufactured, it’s price will not drop as fast and it’s availability will become more limited.

Some vendors also counter this trend by incorporating MLC technology into enterprise SSDs. However, all the problems discussed earlier become an order of magnitude more severe with MLC NAND. For example, rather than 100K write operations to failure with SLC NAND today, it’s more like 10K write operations to failure on current MLC NAND.  The fact that you get 2 to 3 times more storage per cell with MLC doesn’t help that much when one gets 10X less writes per cell. And the next generation of MLC is 10X worse, maybe getting on the order of 1000 writes/cell prior to failure.  Similar issues occur for write performance, MLC writes are much slower than SLC writes.

So yes, raw NAND may become cheaper than 15Krpm Disks on a $/GB basis someday but the complexity to deal with such technology is also going up at an alarming rate.

Why disks will persist

Now something similar can be said for disk density, what with the transition to thermally assisted recording heads/media and the rise of bit-patterned media.  All of which are making disk drives more complex with each generation that comes out.  So what allows disks to persist long after $/GB is cheaper for NAND than disk:

  • Current infrastructure supports disk technology well in enterprise storage. Disks have been around so long, that storage controllers and server applications have all been designed around them.  This legacy provides an advantage that will be difficult and time consuming to overcome. All this will delay NAND/SSD adoption in the enterprise for some time, at least until this infrastructural bias towards disk is neutralized.
  • Disk technology is not standing still.  It’s essentially a race to see who will win the next generations storage.  There is enough of an eco-system around disk that will keep pushing media, heads and mechanisms ever forward into higher densities, better throughput, and more economical storage.

However, any infrastructural advantage can be overcome in time.  What will make this go away even quicker is the existance of a significant advantage over current disk technology in one or more dimensions. Cheaper and faster storage can make this a reality.

Moreover, as for the ecosystem discussion, arguably the NAND ecosystem is even larger than disk.  I don’t have the figures but if one includes SSD drive producers as well as NAND semiconductor manufacturers the amount of capital investment in R&D is at least the size of disk technology if not orders of magnitude larger.

Disks will go extinct someday

So will disks become extinct, yes someday undoubtedly, but when is harder to nail down. Earlier in my career there was talk of super-paramagnetic effect that would limit how much data could be stored on a disk. Advances in heads and media moved that limit out of the way. However, there will come a time where it becomes impossible (or more likely too expensive) to increase magnetic recording density.

I was at a meeting a few years back where a magnetic head researcher predicted that such an end point to disk density increase would come in 25 years time for disk and 30 years for tape.  When this occurs disk density increase will stand still and then it’s a certainty that some other technology will take over.  Because as we all know data storage requirements will never stop increasing.

I think the other major unknown is other, non-NAND semiconductor storage technologies still under research.  They have the potential for  unlimited data retention, balanced performance and sequential performance orders of magnitude faster than disk and can become a much more functional equivalent of disk storage.  Such technologies are not commercially available today in sufficient densities and cost to even threaten NAND let alone disk devices.


So when do disks go extinct.  I would say in 15 to 20 years time we may see the last disks in enterprise storage.  That would give disks an almost an 80 year dominance over storage technology.

But in any event I don’t see disks going away anytime soon in enterprise storage.


Information commerce – part 2

3d personal printer by juhansonin (cc) (from Flickr)
3d personal printer by juhansonin (cc) (from Flickr)

I wrote a post a while back about how interplanetary commerce could be stimulated through the use of information commerce (see my Information based inter-planetary commerce post).  Last week I saw an article in the Economist magazine that discussed new 3D-printers used to create products with just the design information needed to describe a part or product.  Although this is only one type of information commerce, cultivating such capabilities can be one step to the future information commerce I envisioned.

3D Printers Today

3D printers grew up from the 2D inkjet printers of last century.  It turns out if 2D printers can precisely spray ink on a surface it stands to reason that similar technology could potentially build up a 3D structure one plane at a time.  After each layer is created, a laser, infrared light or some other technique is used to set the material into it’s proper form and then the part is incrementally lowered so that the next layer can be created.

Such devices use a form of additive manufacturing which adds material to the exact design specifications necessary to create one part. In contrast, normal part manufacturing activities such as those using a lathe are subtractive manufacturing activities, i.e., they take a block of material and chip away anything that doesn’t belong in the final part design.

3D printers started out making cheap, short-life plastic parts but recently, using titanium oxide powders, have been used to create extremely long lived, metal aircraft parts and nowadays can create any short- or long-lived plastic part imaginable.  A few limitations persist, namely, the size of the printer determines the size of the part or product and 3D printers that can create multi-material parts are fairly limited.

Another problem is the economics of 3D printing of parts, both in time and cost.  Volume production, using subtractive manufacturing of parts is probably still a viable alternative, i.e., if you need to manufacture 1000 or more of the same part, it probably still makes sense to use standard manufacturing techniques.   However, the boundary as to where it makes economic sense to 3D print a part or whether to use a lathe to manufacture a part is gradually moving upward.  Moreover, as more multi-material capable 3D printers start coming online, the economics of volume product manufacturing (not just a single part) will cause a sea change in product construction.

Information based, intra-planetary commerce

The Economist article discussed some implications of sophisticated 3D printers available in the near future.  Specifically, with 3D printers coming soon, manufacturing can now be done locally rather than having to ship parts and products from one country to another.  Using 3D printers all one needed to do was to transmit the product design to wherever it needs to be produced and sold.  They believed this would eliminate most cost advantages available today for low-wage countries that manufacturing parts and products.

The other implication that comes with newer 3D printers is that product customization is now much easier to do.  I envision clothing, furnishing, and other goods that can be literally tailor made for an individual with the proper use of design rule checking CAD software together with local, sophisicated 3D printers.  How Joe Consumer, fires up a CAD program and tailors their product is another matter.  But with 3D printers coming online, sophisticated, CAD knowledgeable users could almost do this today.


In the end, the information needed to create a part or a product will be the key intellectual property.  It’s already been happening for years now but the dawn of 3D printers will accelerate this trend even more.

Also, 3D printers will expand information commerce, joining the already present, information activities provided by the finance, research/science, media, and other information purveyors around the planet today.  Anything that makes information more a part of everyday commerce can be beneficial, whenever we ultimately begin to move off this world to the next planet – let alone when I want to move to Tahitti…


Tape vs. Disk, the saga continues

Inside a (Spectra Logic) T950 library by ChrisDag (cc) (from Flickr)
Inside a (Spectra Logic) T950 library by ChrisDag (cc) (from Flickr)

Was on a call late last month where Oracle introduced their latest generation T1000C tape system (media and drive) holding 5TB native (uncompressed) capacity. In the last 6 months I have been hearing about the coming of a 3TB SATA disk drive from Hitachi GST and others. And last month, EMC announced a new Data Domain Archiver, a disk only archive appliance (see my post on EMC Data Domain products enter the archive market).

Oracle assures me that tape density is keeping up if not gaining on disk density trends and capacity. But density or capacity are not the only issues causing data to move off of tape in today’s enterprise data centers.

“Dedupe Rulz”

A problem with the data density trends discussion is that it’s one dimensional (well literally it’s 2 dimensional). With data compression, disk or tape systems can easily double the density on a piece of media. But with data deduplication, the multiples start becoming more like 5X to 30X depending on frequency of full backups or duplicated data. And number’s like those dwarf any discussion of density ratios and as such, get’s everyone’s attention.

I can remember talking to an avowed tape enginerr, years ago and he was describing deduplication technology at the VTL level as being architecturally inpure and inefficient. From his perspective it needed to be done much earlier in the data flow. But what they failed to see was the ability of VTL deduplication to be plug-compatible with the tape systems of that time. Such ease of adoption allowed deduplication systems to build a beach-head and economies of scale. From there such systems have no been able to move up stream, into earlier stages of the backup data flow.

Nowadays, what with Avamar, Symantec Pure Disk and others, source level deduplication, or close by source level deduplication is a reality. But all this came about because they were able to offer 30X the density on a piece of backup storage.

Tape’s next step

Tape could easily fight back. All that would be needed is some system in front of a tape library that provided deduplication capabilities not just to the disk media but the tape media as well. This way the 30X density over non-deduplicated storage could follow through all the way to the tape media.

In the past, this made little sense because a deduplicated tape would require potentially multiple volumes in order to restore a particular set of data. However, with today’s 5TB of data on a tape, maybe this doesn’t have to be the case anymore. In addition, by having a deduplication system in front of the tape library, it could support most of the immediate data restore activity while data restored from tape was sort of like pulling something out of an archive and as such, might take longer to perform. In any event, with LTO’s multi-partitioning and the other enterprise class tapes having multiple domains, creating a structure with meta-data partition and a data partition is easier than ever.

“Got Dedupe”

There are plenty of places, that today’s tape vendors can obtain deduplication capabilities. Permabit offers Dedupe code for OEM applications for those that have no dedupe systems today. FalconStor, Sepaton and others offer deduplication systems that can be OEMed. IBM, HP, and Quantum already have tape libraries and their own dedupe systems available today all of which can readily support a deduplicating front-end to their tape libraries, if they don’t already.

Where “Tape Rulz”

There are places where data deduplication doesn’t work very well today, mainly rich media, physics, biopharm and other non-compressible big-data applications. For these situations, tape still has a home but for the rest of the data center world today, deduplication is taking over, if it hasn’t already. The sooner tape get’s on the deduplication bandwagon the better for the IT industry.


Of course there are other problems hurting tape today. I know of at least one large conglomerate that has moved all backup off tape altogether, even data which doesn’t deduplicate well (see my previous Oracle RMAN posts). And at least another rich media conglomerate that is considering the very same move. For now, tape has a safe harbor in big science, but it won’t last long.