EMC Data Domain products enter the archive market

(c) 2011 Silverton Consulting, Inc., All Rights Reserved
(c) 2011 Silverton Consulting, Inc., All Rights Reserved

In another assault on the tape market, EMC announced today a new Data Domain 860 Archiver appliance. This new system supports both short-term and long-term retention of backup data. This attacks one of the last bastions of significant tape use – long-term data archives.

Historically, a cheap version of archives had been the long-term retention of full backup tapes. As such, if one needed to keep data around for 5 years, one would keep all their full backup tape sets offsite, in a vault somewhere for 5 years. They could then rotate the tapes (bring them back into scratch use) after the 5 years elapsed. One problem with this – tape technology is advancing to a new generation of technology more like every 2-3 years and as such, a 5-year old tape cartridge would be at least one generation back before it could be re-used. But current tape technology always reads 2 generations and writes at least one generation back so this use would still be feasible. I would say that many tape users did something like this to create a “psuedopseudo-archive”.

On the other hand, there exists many specific archive point products that focused on one or a few application arenas such as email, records, or database archives which would extract specific data items and place them into archive. These did not generally apply outside one or a few application domains but were used to support stringent compliance requirements. The advantage of these application based archive systems is that the data was actually removed from primary storage, out of any data protection activities and placed permanently in only “archive storage”. Such data would be subject to strict retention policies and as such, would be inviolate (couldn’t be modified) and could not be deleted until formally expired.

Enter the Data Domain 860 Archiver, this system supports up to 24 disk shelves, each one of which could either be dedicated to short- or long-term data retention. Backup file data is moved within the appliance by automated policy from short- to long-term storage. Up to 4-disk shelves can be dedicated to short-term storage with the remainder considered long-term archive units.

When a long-term archive unit (disk shelf) fills up with backup data it is “sealed”, i.e., it is given all the metadata required to reconstruct its file system and deduplication domain and thus, would not require the use of other disk shelves to access its data. In this way one creates a standalone unit that contains everything needed to recover the data. Not unlike a full backup tape set which can be used in a standalone fashion to restore data.

Today, the Data Domain 860 Archiver only supports file access and DD boost data access. By doing so, the backup software is responsible for deleting data that has expired. Such data will then be absent deleted from any backups taken and as policy automation copies the backups to long-term archive units it will be missing gone from there as well.

While Data Domain’s Archiver lacks removing the data from ongoing backup streams that application based archive products can achieve, it does look exactly like what could be achieved from tape based archives today.

One can also replicate base Data Domain or Archiver appliances to an Archiver unit to achieve offsite data archives.

—-

Full disclosure: I currently work with EMC on projects specific to other products but am not currently working on anything associated with this product.

Tape, your move…

The problems with digital audio archives

ldbell15 by Zyada (cc) (from Flickr)
ldbell15 by Zyada (cc) (from Flickr)

A recent article in Rolling Stone (File Not Found: The Record Industry’s Digital Storage Crisis) laments the fact that digital recordings can go out of service due to format changes, plugin changes, and/or files not being readable (file not found).

In olden days, multi-track masters were recorded on audio tape and kept in vaults.  Audio tape formats never seemed to change or at least changed infrequently, and thus, re-usable years or decades after being recorded.  And the audio tape drives seemed to last forever.

Digital audio recordings on the other hand, are typically stored in book cases/file cabinets/drawers, on media that can easily become out-of-date technology (i.e., un-readable) and in digital formats that seem to change with every new version of software.

Consumer grade media doesn’t archive very well

The article talks about using hard drives for digital recordings and trying to read them decades after they were recorded.  I would be surprised if they still spin up (due to stiction) let alone still readable.  But even if these were CDs or DVDs, the lifetime of consumer grade media is not that long, maybe a couple of years at best, if treated well and if abused by writing on them or by bad handling, it’s considerably less than that.

Digital audio formats change frequently

The other problem with digital audio recordings is that formats go out of date.  I am no expert but let’s take Apple’s Garage Band as an example.  I would be surprised if 15 years down the line that a 2010 Garage Band session recorded today was readable/usable with Garage Band 2025, assuming it even existed.  Sounds like a long time but it’s probably nothing for popular music coming out today.

Solutions to digital audio media problems

Audio recordings must use archive grade media if it’s to survive for longer than 18-36 months.  I am aware of archive grade DVD disks but have never tested any, so cannot speak to their viability in this application.  However, for an interesting discussion on archive quality CD&DVD media see How to choose CD/DVD archival media. But, there are other alternatives.

Removable data center class archive media today includes magnetic tape, removable magnetic disks or removable MO disks.

  • Magnetic tape – LTO media vendors specify archive life on the order of 30 years, however this assumes a drive exists that can read the media.  The LTO consortium states that current generation drives will read back two generations (LTO-5 drive today reads LTO-4 and LTO-3 media) and write back one generation (LTO-5 drive can write on LTO-4 media [in LTO-4 format]).  With LTO generations coming every 2 years or so, it would only take 6 years for a LTO volume, recorded today to be unreadable by current drives.  Naturally, one could keep an old drive around but maintenance/service would no longer be available for it after a couple of years.  LTO drives are available from a number of vendors.
  • Magnetic disk – The RDX Storage Alliance claims a media archive life of 30 years but I wonder whether a RDX drive would exist that could read it and the other question is how archive life was validated. Today’s removable disk typically imitates a magnetic tape drive/format.  The most prominent removable disk vendor is ProStor Systems but there are others.
  • Magneto-optical (MO) media – Plasmon UDO claims a media life of 50+ years for their magneto-optical media.  UDO has been used for years to record check images, medical information and other data.  Nonetheless,  recently UDO technology has not been able to keep up with other digital archive solutions and have gained a pretty bad rap for usability problems.  However, they plan to release a new generation of UDO product line in 2010 which may shake things up if it arrives and can address their usability issues.

Finally, one could use non-removable, high density disk drives and migrate the audio data every 2-3 years to new generation disks.  This would keep the data readable and continuously accessible.  Modern storage systems with RAID and other advanced protection schemes can protect data from any single and potentially double drive failure but as drives age, their error rate goes up.  This is why the data needs to be moved to new disks periodically.  Naturally, this is more frequently than magnetic tape, but given disk drive usability and capacity gains, might make sense in certain applications.

As for removable USB sticks – unclear what the archive life is for these consumer devices but potentially some version that went after the archive market might make sense.  It would need to be robust, have a long archive life and be cheap enough to compete with all the above.  I just don’t see anything here yet.

Solutions to digital audio format problems

There needs to be an XML-like description of a master recording that reduces everything to a more self-defined level which describes the hierarchy of the recording, and provides object buckets for various audio tracks/assets.  Plugins that create special effects would need to convert their effects to something akin to a MPEG-like track that could be mixed with the other tracks, surrounded by meta-data describing where it starts, ends and other important info.

Baring that, some form of standardization on a master recording format would work.  Such a standard could be supported by all major recording tools and would allow a master recording to be exported and imported across software tools/versions.  As this format evolved, migration/conversion products could be supplied to upgrade old formats to new ones.

Another approach is to have some repository for current master audio recording formats.  As software packages go out of date/business, their recording format could be stored in some “format repository”, funded by the recording industry and maintained in perpetuity.  Plug-in use would need to be documented similarly.  With a repository like this around and “some amount” of coding, no master recording need be lost to out-of-date software formats.

Nonetheless, If your audio archive needs to be migrated periodically, it be a convenient time to upgrade the audio format as well.

—-

I have written about these problems before in a more general sense (see Today’s data and the 1000 year archive) but the recording industry seems to be “leading edge” for these issues. When Producer T Bone Burnett testifies at a hearing that “Digital is a feeble storage medium” it’s time to step up and take action.

Digital storage is no more feeble than analog storage – they each have their strengths and weaknesses.  Analog storage has gone away because it couldn’t keep up with digital recording densities, pricing, and increased functionality.  Just because data is recorded digitally doesn’t mean it has to be impermanent, hard to read 15-35 years hence, or in formats that are no longer supported.  But it does take some careful thought on what storage media you use and on how you format your data.

Comments?

5 killer apps for $0.10/TB/year

iblioteca José Vasconcelos / Vasconcelos Library by * CliNKer * (from flickr) (cc)
iblioteca José Vasconcelos / Vasconcelos Library by * CliNKer * (from flickr) (cc)

Cloud storage keeps getting more viable and I see storage pricing going down considerably over time.  All of which got me thinking what could be done with a dime per TB per year storage ($0.10/TB/yr).  Now most cloud providers charge 10 cents or more per GB per month so this is at least 12,000 times less expensive but it’s inevitable at some point in time.

So here are my 5 killer apps for $0.10/TB/yr cloud storage:

  1. Photo record of life – something akin to glasses which would record a wide angle, high mega-pixel video record of everything I looked at, for every second of my waking life.  I think at a photo shot every second for 12hrs/day 365days/yr would be about ~16M photos and at 4 MB per photo this would be about ~64TB per person year.  For my 4 person family this would cost ~$26/year for each year of family life and for a 40 year family time span, the last payment for this would be ~$1040 or an average payment of $520/year.
  2. Audio recording of life – something akin to a always on bluetooth headset which would record an audio feed to go with the semi-video or photo record above.  By being an always on bluetooth headset it would automatically catch cell phone as well as  spoken conversations but it would need to plug to landlines as well.  As discussed in my YB by 2015 archive post, one minute of MP3 audio recording takes up roughly a MB of storage.  Lets say I converse with someone ~33% of my waking day.  So this would be about 4 hrs of MP3 audio/day 365days/yr or about 21TB per year per person.  For my family this would cost or ~$8.40/year for storage and for a 40 year family life span my last payment would be ~$336 or an average of $168/yr.
  3. Home security cameras – with ethernet based security cameras, it wouldn’t be hard to record a 360 degree outside as well as inside points of entry coverage video.  The quantities for the photo record of my life would suffice for here as well but one doesn’t need to retain the data for a whole year perhaps a rolling 30 day record would suffice but it would be recorded for 24 hours. Assuming 8 cameras outside and inside,  this could be stored in about 10TB of storage per camera, or  about 80TB of storage or $8/year but would not increase over time.
  4. No more deletes/version everything – if storage were cheap enough we would never delete data.  Normal data change activity is in the 5 to 10% per week rate, but this does not account for duplicating deleted data.  So let’s say we would need to store an additional 20% of your primary/active data per week for deleted data.  For a 1TB primary storage working set, a ~20% deletion rate per week would be 10TB of deleted data per year per person and for my family ~$4/yr and my last yearly payment would be ~$160.  If we were to factor in data growth rates of ~20%/year, this would go up substantially averaging ~$7.3k/yr over 40 years.
  5. Customized search engines – if storage AND bandwidth were cheap enough it would be nice to have my own customized search engine. Such a capability would follow all my web clicks, spawning a search spider for every website I traverse and provide customized “deep” searching for every web page I view.   Such an index might take 50% of the size of a page and on average my old website used ~18KB per page, so at 50% this index would require 9KB. Assuming, I look at ~250 web pages per business day of which maybe ~170 are unique and each unique page probably links to 2 more unique pages, which links to two more, which links to two more, … If we go 10 pages deep, then for 170 pages viewed, an average branching factor of 2,  we would need to index ~174K pages/day and for a year, this would represent about represent about 0.6TB of page index.  For my household, a customized search engine would cost  ~$0.25 of additional storage per year and for 40 years my last payment would be $10.

I struggled with coming with ideas that would cost between $10 and $500 a year as every other storage use came out significantly less than $1/year for a family of four.  This seems to say that there might be plenty of applications in the range of under a $10 per TB per year, still 1200X current cloud storage costs.

Any other applications out there that could take  advantage of a dime/TB/year?

What is cloud storage good for?

Facebook friend carrousel by antjeverena (cc) (from flickr)
Facebook friend carrousel by antjeverena (cc) (from flickr)

Cloud storage has emerged  as a viable business service in the last couple of years, but what does cloud storage really do for the data center.  Moving data out to the cloud makes for unpredictable access times with potentially unsecured and unprotected data.  So what does the data center gain by using cloud storage?

  • Speed – it  often takes a long time (day-weeks-months) to add storage to in-house data center infrastructure.  In this case, having a cloud storage provider where one can buy additional storage by the GB/Month may make sense if one is developing/deploying new applications where speed to market is important.
  • Flexibility – data center storage is often leased or owned for long time periods.  If an application’s data storage requirements vary significantly over time then cloud storage, purchase-able or retire-able on a moments notice, may be just right.
  • Distributed data access – some applications require data to be accessible around the world.  Most cloud providers have multiple data centers throughout the world that can be used to host one’s data. Such multi-site data centers can be often be accessed much quicker than going back to a central data center.
  • Data archive – backing up data that is infrequently accessed wastes time and resources. As such, this data could easily reside in the cloud with little trouble.  References to such data would need to be redirected to one’s cloud provider but that’s about all that needs to be done.
  • Disaster recovery – disaster recovery for many data centers is very low on their priority list.  Cloud storage provides an easy, ready made solution to accessing one’s data outside the data center.  If you elect to copy all mission critical data out to the cloud on a periodic basis, then this data could theoretically be accessed anywhere, usable in many DR scenarios.

Probably some I am missing here but these will do for now.  Most cloud storage providers can provide any and all of these services.

Of course all these capabilities can be done in-house with additional onsite infrastructure, multi-site data centers, archive systems, or offsite backups.  But the question then becomes which is more economical.  Cloud providers can amortize their multi-site data centers across many customers and as such, may be able to provide these services much cheaper than could be done in-house.

Now if they could only solve that unpredictable access time, …

7 grand challenges for the next storage century

Clock tower (4) by TJ Morris (cc) (from flickr)
Clock tower (4) by TJ Morris (cc) (from flickr)

I saw a recent IEEE Spectrum article on engineering’s grand challenges for the next century and thought something similar should be done for data storage. So this is a start:

  • Replace magnetic storage – most predictions show that magnetic disk storage has another 25 years and magnetic tape another decade after that before they run out of steam. Such end-dates have been wrong before but it is unlikely that we will be using disk or tape 50 years from now. Some sort of solid state device seems most probable as the next evolution of storage. I doubt this will be NAND considering its write endurance and other long-term reliability issues but if such issues could be re-solved maybe it could replace magnetic storage.
  • 1000 year storage – paper can be printed today with non-acidic based ink and retain its image for over a 1000 years. Nothing in data storage today can claim much more than a 100 year longevity. The world needs data storage that lasts much longer than 100 years.
  • Zero energy storage – today SSD/NAND and rotating magnetic media consume energy constantly in order to be accessible. Ultimately, the world needs some sort of storage that only consumes energy when read or written or such storage would provide “online access with offline power consumption”.
  • Convergent fabrics running divergent protocols – whether it’s ethernet, infiniband, FC, or something new, all fabrics should be able to handle any and all storage (and datacenter) protocols. The internet has become so ubiquitous becauset it handles just about any protocol we throw at it. We need the same or something similar for datacenter fabrics.
  • Securing data – securing books or paper is relatively straightforward today, just throw them in a vault/safety deposit box. Securing data seems simple but yet is not widely used today. It doesn’t have to be that way. We need better, more long lasting tools and methodology to secure our data.
  • Public data repositories – libraries exist to provide access to the output of society in the form of books, magazines, papers and other printed artifacts. No such repository exists today for data. Society would be better served if we could store and retrieve data if there were library like institutions could store data. Most of these issues are legal due to data ownership but technological issues exist here as well.
  • Associative accessed storage – Sequential and random access have been around for over half a century now. Associative storage could complement these and be another approach allowing storage to be retrieved by its content. We can kind of do this today by keywording and indexing data. Biological memory is accessed associations or linkages to other concepts, once accessed memory seem almost sequentially accessed from there. Something comparable to biological memory may be required to build more intelligent machines.

Some of these are already being pursued and yet others receive no interest today. Nonetheless, I believe they all deserve investigation, if storage is to continue to serve its primary role to society, as a long term storehouse for society’s culture, thoughts and deeds.

Comments?