Archive – Silverton Consulting

EMC Data Domain products enter the archive market

Posted on January 18, 2011January 24, 2011 by Ray in Data, data protection, Data retention, Storage Backup, Storage longevity

(c) 2011 Silverton Consulting, Inc., All Rights Reserved

In another assault on the tape market, EMC announced today a new Data Domain ~~860~~ Archiver ~~appliance~~. This new system supports both short-term and long-term retention of backup data. This attacks one of the last bastions of significant tape use – long-term data archives.

Historically, a cheap version of archives had been the long-term retention of full backup tapes. As such, if one needed to keep data around for 5 years, one would keep all their full backup tape sets offsite, in a vault somewhere for 5 years. They could then rotate the tapes (bring them back into scratch use) after the 5 years elapsed. One problem with this – tape technology is advancing to a new generation of technology more like every 2-3 years and as such, a 5-year old tape cartridge would be at least one generation back before it could be re-used. But current tape technology always reads 2 generations and writes at least one generation back so this use would still be feasible. I would say that many tape users did something like this to create a “~~psuedo~~pseudo-archive”.

On the other hand, there exists many specific archive point products that focused on one or a few application arenas such as email, records, or database archives which would extract specific data items and place them into archive. These did not generally apply outside one or a few application domains but were used to support stringent compliance requirements. The advantage of these application based archive systems is that the data was actually removed from primary storage, out of any data protection activities and placed permanently in only “archive storage”. Such data would be subject to strict retention policies and as such, would be inviolate (couldn’t be modified) and could not be deleted until formally expired.

Enter the Data Domain 860 Archiver, this system supports up to 24 disk shelves, each one of which could either be dedicated to short- or long-term data retention. Backup file data is moved within the appliance by automated policy from short- to long-term storage. Up to 4-disk shelves can be dedicated to short-term storage with the remainder considered long-term archive units.

When a long-term archive unit (disk shelf) fills up with backup data it is “sealed”, i.e., it is given all the metadata required to reconstruct its file system and deduplication domain and thus, would not require the use of other disk shelves to access its data. In this way one creates a standalone unit that contains everything needed to recover the data. Not unlike a full backup tape set which can be used in a standalone fashion to restore data.

Today, the Data Domain 860 Archiver only supports file access and DD boost data access. By doing so, the backup software is responsible for deleting data that has expired. Such data will then be ~~absent~~ deleted from any backups taken and as policy automation copies the backups to long-term archive units it will be ~~missing~~ gone from there as well.

While Data Domain’s Archiver lacks removing the data from ongoing backup streams that application based archive products can achieve, it does look exactly like what could be achieved from tape based archives today.

One can also replicate base Data Domain or Archiver appliances to an Archiver unit to achieve offsite data archives.

—-

Full disclosure: I currently work with EMC on projects specific to other products but am not currently working on anything associated with this product.

Tape, your move…

Backup is for (E)discovery too

Posted on December 18, 2009 by Ray in Data discovery, Data search, Systems

Electronic Discovery Reference Model (from EDRM.net)

There has been lot’s of talk in twitterverse and elsewhere on how “backup is used for restore and archive is for e-discovery”, but I beg to differ.

If one were to take the time to review the EDRM (Electronic Discovery Reference Model) and analyze what happens during actual e-discovery processes, one would see that nothing is outside the domain of court discovery requests. Backups have and always will hold discoverable data just as online and user desktop/laptop storage do. In contrast, archives are not necessarily a primary source of discoverable data.

In my view, any data not in archive, by definition is online or on user desktop/laptop storage. Once online, data is most likely being backed up periodically and will show up in backups long before it’s moved to archive. Data deletions and other modifications can often be reconstructed from backups much better than from archive (with the possible exception of records management systems). Also, reconstructing data proliferation, such as who had a copy of what data when, is often crucial to court proceedings and normally, can only be reconstructed from backups.

Archives have a number of purposes but primarily it’s to move data that doesn’t change off company storage and out of its backup stream. Another popular reason for archive is to be used to satisfy compliance regimens that require companies to hold data for periods of time, such as mandated by SEC, HIPPA, SOX, and others. For example, SEC brokerage records must be held long after an account goes inactive, HIPPA health records must be held long after a hospital visit, SOX requires corporate records to be held long after corporate transactions transpire. Such records are more for compliance and/or customer back-history request purposes than e-discovery but here again any data stored by the corporation is discoverable.

So I believe it’s wrong to say that Backup is only for restore and archive is only for discovery. Information, anywhere within a company is discoverable. However, I would venture to say that a majority of e-discovery data comes from backups rather than elsewhere.

Now, as for using backups for restore,…

Protecting the Yottabyte archive

Posted on November 11, 2009January 27, 2010 by Ray in Data index, Networking, Storage, Storage Backup, Storage density, Storage performance, Systems

blinkenlights by habi (cc) (from flickr)

In a previous post I discussed what it would take to store 1YB of data in 2015 for the National Security Agency (NSA). Due to length, that post did not discuss many other aspects of the 1YB archive such as ingest, index, data protection, etc. Thus, I will attempt to cover each of these in turn and as such, this post will cover some of the data protection aspects of the 1YB archive and its catalog/index.

RAID protecting 1YB of data

Protecting the 1YB archive will require some sort of parity protection. RAID data protection could certainly be used and may need to be extended to removable media (RAID for tape), but that would require somewhere in the neighborhood of 10-20% additional storage (RAID5 across 10 to 5 tape drives). It’s possible with Reed-Solomon encoding and using RAID6 that we could take this down to 5-10% of additional storage (RAID 6 for a 40 to a 20 wide tape drive stripe). Possibly other forms of ECC (such as turbo codes) might be usable in a RAID like configuration which would give even better reliability with less additional storage.

But RAID like protection also applies to the data catalog and indexes required to access the 1YB archive of data. Ditto for the online data itself while it’s being ingested, indexed, or readback. For the remainder of this post I ignore the RAID overhead but suffice it to say with today’s an additional 10% storage for parity will not change this discussion much.

Also in the original post I envisioned a multi-tier storage hierarchy but the lowest tier always held a copy of any files residing in the upper tiers. This would provide some RAID1 like redundancy for any online data. This might be pretty usefull, i.e., if a file is of high interest, it could have been accessed recently and therefore resides in upper storage tiers. As such, multiple copies of interesting files could exist.

Catalog and indexes backups for 1YB archive

IMHO, RAID or other parity protection is different than data backup. Data backup is generally used as a last line of defense for hardware failure, software failure or user error (deleting the wrong data). It’s certainly possible that the lowest tier data is stored on some sort of WORM (write once read many times) media meaning it cannot be overwritten, eliminating one class of user error.

But this presumes the catalog is available and the media is locatable. Which means the catalog has to be preserved/protected from user error, HW and SW failures. I wrote about whether cloud storage needs backup in a prior post and feel strongly that the 1YB archive would also require backups as well.

In general, backup today is done by copying the data to some other storage and keeping that storage offsite from the original data center. At this amount of data, most likely the 2.1×10**21 of catalog (see original post) and index data would be copied to some form of removable media. The catalog is most important as the other two indexes could potentially be rebuilt from the catalog and original data. Assuming we are unwilling to reindex the data, with LTO-6 tape cartridges, the catalog and index backups would take 1.3×10**9 LTO-6 cartridges (at 1.6×10**12 bytes/cartridge).

To back up this amount of data once per month would take a gaggle of tape drives. There are ~2.6×10**6 seconds/month and each LTO-6 drive can transfer 5.4×10**8 bytes/sec or 1.4X10**15 bytes/drive-month but we need to backup 2.1×10**21 bytes of data so we need ~1.5×10**6 tape transports. Now tapes do not operate 100% of the time because when a cartridge becomes full it has to be changed out with an empty one, but this amounts to a rounding error at these numbers.

To figure out the tape robotics needed to service 1.5×10**6 transports we could use the latest T-finity tape library just announced by Spectra Logic . The T-Finity supports 500 tape drives and 122,000 tape cartridges, so we would need 3.0×10**3 libraries to handle the drive workload and about 1.1×10**4 libraries to store the cartridge set required, so 11,000 T-finity libraries would suffice. Presumably, using LTO-7 these numbers could be cut in half ~5,500 libraries, ~7.5×10**5 transports, and 6.6×10**8 cartridges.

Other removable media exist, most notably the Prostor RDX. However RDX roadmap info out to the next generation are not readily available and high-end robotics are do not currently support RDX. So for the moment tape seems the only viable removable backup for the catalog and index for the 1YB archive.

Mirroring the data

Another approach to protecting the data is to mirror the catalog and index data. This involves taking the data and copying it to another online storage repository. This doubles the storage required (to 4.2×10**21 bytes of storage). Replication doesn’t easily protect from user error but is an option worthy of consideration.

Networking infrastructure needed

Whether mirroring or backing up to tape, moving this amount of data will require substantial networking infrastructure. If we assume that in 2105 we have 32GFC (32 gb/sec fibre channel interfaces). Each interface could potentially transfer 3.2GB/s or 3.2×10**9 bytes/sec. Mirroring or backing up 2.1×10**21 bytes over one month will take ~2.5×10**6 32GFC interfaces. Probably should have twice this amount of networking just to not have any one be a bottleneck so 5×10**6 32GFC interfaces should work.

As for switches, the current Brocade DCX supports 768 8GFC ports and presumably similar port counts will be available in 2015 to support 32GFC. In addition if we assume at least 2 ports per link, we will need ~6,500 fully populated DCX switches. This doesn’t account for multi-layer switches and other sophisticated switch topologies but could be accommodated with another factor of 2 or ~13,000 switches.

Hot backups require journals

This all assumes we can do catalog and index backups once per month and take the whole month to do them. Now storage today normally has to be taken offline (via snapshot or some other mechanism) to be backed up in a consistent state. While it’s not impossible to backup data that is concurrently being updated it is more difficult. In this case, one needs to maintain a journal file of the updates going on while the data is being backed up and be able to apply the journaled changes to the data backed up.

For the moment I am not going to determine the storage requirements for the journal file required to cover the catalog transactions for a month, but this is dependent on the change rate of the catalog data. So it will necessarily be a function of the index or ingest rate of the 1YB archive to be covered in a future post.

Stay tuned, I am just having too much fun to stop.