NSA’s huge (YBs) new data center to turn on in 2013



Ran across a story in Wired about the new NSA Utah data center today which is scheduled to be operational in September of 2013.

This new data center is intended to house copies of all communications intercepted the NSA.  We have talked about this data center before and how it’s going to store YB of data (See my Yottabytes by 2015?! post).

One major problem with having a YB of communications intercepts is that you need to have multiple copies of it for protection in case of human or technical error.

Apparently, NSA has a secondary data center to backup its Utah facility in San Antonio.   That’s one copy.   We also wrote another post on protecting and indexing all this data (see my Protecting the Yottabyte Archive post)

NSA data centers

The Utah facility has enough fuel onsite to power and cool the data center for 3 days.  They have a special power station to supply the 65MW of power needed.   They have two side by side raised floor halls for servers, storage and switches, each with 25K square feet of floor space. That doesn’t include another 900K square feet of technical support and office space to secure and manage the data center.

In order to help collect and temporarily storage all this information, apparently the agency has been undergoing a data center building boom, renovating and expanding their data centers throughout the states.  The article discusses some of other NSA information collection points/data centers, in Texas, Colorado, Georgia, Hawaii, Tennessee, and of course,  Maryland.

New NSA super computers

In addition to the communication intercept storage, the article also talks about a special purpose, decrypting super computer that NSA has invented over the past decade which will also be housed in the Utah data center.  The NSA seems to have created a super powerful computer that dwarfs the current best Cray XT5 super computer clusters that operate at 1.75 petaflops available today.

I suppose what with all the encrypted traffic now being generated, NSA would need some way to decrypt this information in order to understand it.  I was under the impression that they were interested in the non-encrypted communications, but I guess NSA is even more interested in any encrypted traffic.

Decrypting old data

With all this data being stored, the thought is that the data now encrypted with unbreakable AES-128, -192 or -256 encryption will eventually become decypherable.  At that time, foriegn government and other secret communications will all be readable.

By storing this secret communications now, they can scan this treasure trove for patterns that eventually occur and once found, such patterns will ultimately lead to decrypting the data.  Now we know why they need YB of storage.

So NSA will at least know what was going on in the past.  However, how soon they can move that up to do real time decryption of communications today is another question.  But knowing the past, may help in understanding what’s going on today.


So be careful what you say today even if it’s encrypted.  Someone (NSA and its peers around the world) will probably be listening in and someday soon, will understand every word that’s been said.


The problems with digital audio archives

ldbell15 by Zyada (cc) (from Flickr)
ldbell15 by Zyada (cc) (from Flickr)

A recent article in Rolling Stone (File Not Found: The Record Industry’s Digital Storage Crisis) laments the fact that digital recordings can go out of service due to format changes, plugin changes, and/or files not being readable (file not found).

In olden days, multi-track masters were recorded on audio tape and kept in vaults.  Audio tape formats never seemed to change or at least changed infrequently, and thus, re-usable years or decades after being recorded.  And the audio tape drives seemed to last forever.

Digital audio recordings on the other hand, are typically stored in book cases/file cabinets/drawers, on media that can easily become out-of-date technology (i.e., un-readable) and in digital formats that seem to change with every new version of software.

Consumer grade media doesn’t archive very well

The article talks about using hard drives for digital recordings and trying to read them decades after they were recorded.  I would be surprised if they still spin up (due to stiction) let alone still readable.  But even if these were CDs or DVDs, the lifetime of consumer grade media is not that long, maybe a couple of years at best, if treated well and if abused by writing on them or by bad handling, it’s considerably less than that.

Digital audio formats change frequently

The other problem with digital audio recordings is that formats go out of date.  I am no expert but let’s take Apple’s Garage Band as an example.  I would be surprised if 15 years down the line that a 2010 Garage Band session recorded today was readable/usable with Garage Band 2025, assuming it even existed.  Sounds like a long time but it’s probably nothing for popular music coming out today.

Solutions to digital audio media problems

Audio recordings must use archive grade media if it’s to survive for longer than 18-36 months.  I am aware of archive grade DVD disks but have never tested any, so cannot speak to their viability in this application.  However, for an interesting discussion on archive quality CD&DVD media see How to choose CD/DVD archival media. But, there are other alternatives.

Removable data center class archive media today includes magnetic tape, removable magnetic disks or removable MO disks.

  • Magnetic tape – LTO media vendors specify archive life on the order of 30 years, however this assumes a drive exists that can read the media.  The LTO consortium states that current generation drives will read back two generations (LTO-5 drive today reads LTO-4 and LTO-3 media) and write back one generation (LTO-5 drive can write on LTO-4 media [in LTO-4 format]).  With LTO generations coming every 2 years or so, it would only take 6 years for a LTO volume, recorded today to be unreadable by current drives.  Naturally, one could keep an old drive around but maintenance/service would no longer be available for it after a couple of years.  LTO drives are available from a number of vendors.
  • Magnetic disk – The RDX Storage Alliance claims a media archive life of 30 years but I wonder whether a RDX drive would exist that could read it and the other question is how archive life was validated. Today’s removable disk typically imitates a magnetic tape drive/format.  The most prominent removable disk vendor is ProStor Systems but there are others.
  • Magneto-optical (MO) media – Plasmon UDO claims a media life of 50+ years for their magneto-optical media.  UDO has been used for years to record check images, medical information and other data.  Nonetheless,  recently UDO technology has not been able to keep up with other digital archive solutions and have gained a pretty bad rap for usability problems.  However, they plan to release a new generation of UDO product line in 2010 which may shake things up if it arrives and can address their usability issues.

Finally, one could use non-removable, high density disk drives and migrate the audio data every 2-3 years to new generation disks.  This would keep the data readable and continuously accessible.  Modern storage systems with RAID and other advanced protection schemes can protect data from any single and potentially double drive failure but as drives age, their error rate goes up.  This is why the data needs to be moved to new disks periodically.  Naturally, this is more frequently than magnetic tape, but given disk drive usability and capacity gains, might make sense in certain applications.

As for removable USB sticks – unclear what the archive life is for these consumer devices but potentially some version that went after the archive market might make sense.  It would need to be robust, have a long archive life and be cheap enough to compete with all the above.  I just don’t see anything here yet.

Solutions to digital audio format problems

There needs to be an XML-like description of a master recording that reduces everything to a more self-defined level which describes the hierarchy of the recording, and provides object buckets for various audio tracks/assets.  Plugins that create special effects would need to convert their effects to something akin to a MPEG-like track that could be mixed with the other tracks, surrounded by meta-data describing where it starts, ends and other important info.

Baring that, some form of standardization on a master recording format would work.  Such a standard could be supported by all major recording tools and would allow a master recording to be exported and imported across software tools/versions.  As this format evolved, migration/conversion products could be supplied to upgrade old formats to new ones.

Another approach is to have some repository for current master audio recording formats.  As software packages go out of date/business, their recording format could be stored in some “format repository”, funded by the recording industry and maintained in perpetuity.  Plug-in use would need to be documented similarly.  With a repository like this around and “some amount” of coding, no master recording need be lost to out-of-date software formats.

Nonetheless, If your audio archive needs to be migrated periodically, it be a convenient time to upgrade the audio format as well.


I have written about these problems before in a more general sense (see Today’s data and the 1000 year archive) but the recording industry seems to be “leading edge” for these issues. When Producer T Bone Burnett testifies at a hearing that “Digital is a feeble storage medium” it’s time to step up and take action.

Digital storage is no more feeble than analog storage – they each have their strengths and weaknesses.  Analog storage has gone away because it couldn’t keep up with digital recording densities, pricing, and increased functionality.  Just because data is recorded digitally doesn’t mean it has to be impermanent, hard to read 15-35 years hence, or in formats that are no longer supported.  But it does take some careful thought on what storage media you use and on how you format your data.


What is cloud storage good for?

Facebook friend carrousel by antjeverena (cc) (from flickr)
Facebook friend carrousel by antjeverena (cc) (from flickr)

Cloud storage has emerged  as a viable business service in the last couple of years, but what does cloud storage really do for the data center.  Moving data out to the cloud makes for unpredictable access times with potentially unsecured and unprotected data.  So what does the data center gain by using cloud storage?

  • Speed – it  often takes a long time (day-weeks-months) to add storage to in-house data center infrastructure.  In this case, having a cloud storage provider where one can buy additional storage by the GB/Month may make sense if one is developing/deploying new applications where speed to market is important.
  • Flexibility – data center storage is often leased or owned for long time periods.  If an application’s data storage requirements vary significantly over time then cloud storage, purchase-able or retire-able on a moments notice, may be just right.
  • Distributed data access – some applications require data to be accessible around the world.  Most cloud providers have multiple data centers throughout the world that can be used to host one’s data. Such multi-site data centers can be often be accessed much quicker than going back to a central data center.
  • Data archive – backing up data that is infrequently accessed wastes time and resources. As such, this data could easily reside in the cloud with little trouble.  References to such data would need to be redirected to one’s cloud provider but that’s about all that needs to be done.
  • Disaster recovery – disaster recovery for many data centers is very low on their priority list.  Cloud storage provides an easy, ready made solution to accessing one’s data outside the data center.  If you elect to copy all mission critical data out to the cloud on a periodic basis, then this data could theoretically be accessed anywhere, usable in many DR scenarios.

Probably some I am missing here but these will do for now.  Most cloud storage providers can provide any and all of these services.

Of course all these capabilities can be done in-house with additional onsite infrastructure, multi-site data centers, archive systems, or offsite backups.  But the question then becomes which is more economical.  Cloud providers can amortize their multi-site data centers across many customers and as such, may be able to provide these services much cheaper than could be done in-house.

Now if they could only solve that unpredictable access time, …

Protecting the Yottabyte archive

blinkenlights by habi (cc) (from flickr)
blinkenlights by habi (cc) (from flickr)

In a previous post I discussed what it would take to store 1YB of data in 2015 for the National Security Agency (NSA). Due to length, that post did not discuss many other aspects of the 1YB archive such as ingest, index, data protection, etc. Thus, I will attempt to cover each of these in turn and as such, this post will cover some of the data protection aspects of the 1YB archive and its catalog/index.

RAID protecting 1YB of data

Protecting the 1YB archive will require some sort of parity protection. RAID data protection could certainly be used and may need to be extended to removable media (RAID for tape), but that would require somewhere in the neighborhood of 10-20% additional storage (RAID5 across 10 to 5 tape drives). It’s possible with Reed-Solomon encoding and using RAID6 that we could take this down to 5-10% of additional storage (RAID 6 for a 40 to a 20 wide tape drive stripe). Possibly other forms of ECC (such as turbo codes) might be usable in a RAID like configuration which would give even better reliability with less additional storage.

But RAID like protection also applies to the data catalog and indexes required to access the 1YB archive of data. Ditto for the online data itself while it’s being ingested, indexed, or readback. For the remainder of this post I ignore the RAID overhead but suffice it to say with today’s an additional 10% storage for parity will not change this discussion much.

Also in the original post I envisioned a multi-tier storage hierarchy but the lowest tier always held a copy of any files residing in the upper tiers. This would provide some RAID1 like redundancy for any online data. This might be pretty usefull, i.e., if a file is of high interest, it could have been accessed recently and therefore resides in upper storage tiers. As such, multiple copies of interesting files could exist.

Catalog and indexes backups for 1YB archive

IMHO, RAID or other parity protection is different than data backup. Data backup is generally used as a last line of defense for hardware failure, software failure or user error (deleting the wrong data). It’s certainly possible that the lowest tier data is stored on some sort of WORM (write once read many times) media meaning it cannot be overwritten, eliminating one class of user error.

But this presumes the catalog is available and the media is locatable. Which means the catalog has to be preserved/protected from user error, HW and SW failures. I wrote about whether cloud storage needs backup in a prior post and feel strongly that the 1YB archive would also require backups as well.

In general, backup today is done by copying the data to some other storage and keeping that storage offsite from the original data center. At this amount of data, most likely the 2.1×10**21 of catalog (see original post) and index data would be copied to some form of removable media. The catalog is most important as the other two indexes could potentially be rebuilt from the catalog and original data. Assuming we are unwilling to reindex the data, with LTO-6 tape cartridges, the catalog and index backups would take 1.3×10**9 LTO-6 cartridges (at 1.6×10**12 bytes/cartridge).

To back up this amount of data once per month would take a gaggle of tape drives. There are ~2.6×10**6 seconds/month and each LTO-6 drive can transfer 5.4×10**8 bytes/sec or 1.4X10**15 bytes/drive-month but we need to backup 2.1×10**21 bytes of data so we need ~1.5×10**6 tape transports. Now tapes do not operate 100% of the time because when a cartridge becomes full it has to be changed out with an empty one, but this amounts to a rounding error at these numbers.

To figure out the tape robotics needed to service 1.5×10**6 transports we could use the latest T-finity tape library just announced by Spectra Logic . The T-Finity supports 500 tape drives and 122,000 tape cartridges, so we would need 3.0×10**3 libraries to handle the drive workload and about 1.1×10**4 libraries to store the cartridge set required, so 11,000 T-finity libraries would suffice. Presumably, using LTO-7 these numbers could be cut in half ~5,500 libraries, ~7.5×10**5 transports, and 6.6×10**8 cartridges.

Other removable media exist, most notably the Prostor RDX. However RDX roadmap info out to the next generation are not readily available and high-end robotics are do not currently support RDX. So for the moment tape seems the only viable removable backup for the catalog and index for the 1YB archive.

Mirroring the data

Another approach to protecting the data is to mirror the catalog and index data. This involves taking the data and copying it to another online storage repository. This doubles the storage required (to 4.2×10**21 bytes of storage). Replication doesn’t easily protect from user error but is an option worthy of consideration.

Networking infrastructure needed

Whether mirroring or backing up to tape, moving this amount of data will require substantial networking infrastructure. If we assume that in 2105 we have 32GFC (32 gb/sec fibre channel interfaces). Each interface could potentially transfer 3.2GB/s or 3.2×10**9 bytes/sec. Mirroring or backing up 2.1×10**21 bytes over one month will take ~2.5×10**6 32GFC interfaces. Probably should have twice this amount of networking just to not have any one be a bottleneck so 5×10**6 32GFC interfaces should work.

As for switches, the current Brocade DCX supports 768 8GFC ports and presumably similar port counts will be available in 2015 to support 32GFC. In addition if we assume at least 2 ports per link, we will need ~6,500 fully populated DCX switches. This doesn’t account for multi-layer switches and other sophisticated switch topologies but could be accommodated with another factor of 2 or ~13,000 switches.

Hot backups require journals

This all assumes we can do catalog and index backups once per month and take the whole month to do them. Now storage today normally has to be taken offline (via snapshot or some other mechanism) to be backed up in a consistent state. While it’s not impossible to backup data that is concurrently being updated it is more difficult. In this case, one needs to maintain a journal file of the updates going on while the data is being backed up and be able to apply the journaled changes to the data backed up.

For the moment I am not going to determine the storage requirements for the journal file required to cover the catalog transactions for a month, but this is dependent on the change rate of the catalog data. So it will necessarily be a function of the index or ingest rate of the 1YB archive to be covered in a future post.

Stay tuned, I am just having too much fun to stop.

XAM and data archives

Vista de la Biblioteca Vasconcelos by Eneas
Vista de la Biblioteca Vasconcelos by Eneas

XAM, a SNIA defined interface standard supporting reference data archives, is starting to become real. EMC and other vendors are starting to supply XAM compliant interfaces.  I could not locate (my Twitter survey for application vendors came back empty) any application vendors supporting XAM APIs but its only a matter of time .  What does XAM mean for your data archive?

The problem

Most IT shops with data archives use special purpose applications that support a vendor defined proprietary interface to store and retrieve data out of a dedicated archive appliance. For example, many email archives support EMC Centerra which has defined a proprietary Centerra API to store and retrieve data from their appliance.  Most other archive storage vendors have followed suit.  Leading to a proprietary vendor lock-in which slows adoption.

However, some proprietary APIs have been front-ended with something like NFS. The problem with NFS and other standard file interfaces is that they were never meant for reference data (data that does not change). So when you try to update an archived file one often gets some sort of weird system error.

Enter XAM

It was designed from the start for reference data. Moreover, XAM supports concurrent access to multiple vendor archive storage systems from the same application. As such, an application supplier need only code to one standard API to gain access to multiple vendor archive systems.

SNIA released the V1.0 XAM interface specfication last July  which defines XAM architecture, C- and JAVA-language API for both the application and the storage vendor.   Although from the looks of it the C version of vendor API is more complete.

However, currently I can only locate two archive storage vendors having released support for the XAM interface (EMC Centerra and SAND/DNA?).   A number of vendors have expressed interest in providing XAM interfaces (HP, HDS HCAP, Bycast StorageGrid and others).   How soon their XAM API support will be provided is TBD.

I would guess what’s really needed is for more vendors to start supporting XAM interface which would get the application vendors more interested in supporting XAM.   Its sort of a chicken and egg thing but I believe the storage vendors have the first move, the application vendors will take more time to see the need.

Does anyone know what other storage vendors support XAM today. Is there any single place where one could even find out? Ditto for applications supporting XAM today?