Well, maybe an Exabyte a day was way too small for 2009. NSA is now reporting that they may be storing yottabytes (YB, 10**24) of data by 2015 somewhere in Utah. Later reports have NSA reducing this down to something closer to 1000 PB or so but YB of storage got me thinking.
This points out a couple of issues:
- How is the NSA going to store all this data.
- How will the NSA be able to retrieve anything in this amount of data.
- The storage industry must come up with a new term that applies to 10**27 bytes of storage.
As a first stab at this I would suggest NONABYTE (nona- is latin for nine, (y)otta- is italian for eight). In a similar way, perhaps we could use DECEMABYTE for 10**30 and UNDECEMABYTE for 10**33. That should last us for a couple of years.
Storing a yottabyte of data is no small matter. 10 to 100 Petabytes (PB, 10**15 bytes) of data can be dealt with today with a number of storage systems both cloud and non-cloud. Many cloud providers claim PB of storage under their environment so this is entirely feasible today.
Exabytes (XB, 10**18 bytes) would seem to require an offline archive of data. Of course, somebody could conceivably build such an online storage complex (see below for how). Testing such a system might only be possible during implementation but that would not be unusual for such leading edge projects.
Zetabytes (ZB, 10**21 bytes) seems outside the realm of possibility today being a million PB of storage. But offline archives could conceivably be built even for this amount of storage. It’s conceivable that online storage of an XB of data could be used to support offline storage of a ZB of data.
1 YB of data in perspective
Yottabytes of data seem extremely large. If a minute of standard definition digital video takes ~GB of storage, a yottabyte would be about 10**15 minutes of video.
A minute of MP3 audio (as in a phone conversation) takes roughly a MB of storage, so 1 YB would be about 10**18 minutes of conversation. Realize there are only ~6×10**9 people on the planet. So this is enough storage for a ~100 million (10**8) minutes of conversation from everyone on the planet. Seems like a lot, but who am I to judge.
Also realize there are only 5×10**5 minutes/year, so 10**24 would be enough storage to record everything everybody said over ~333 years (mb/minute 10**6 X 6×10**9 people on earth X 5×10**5 minutes per year=3×10**21 bytes required to store one year of everyone talking for the whole year). Also people sleep, don’t often talk 100% of wake time and most conversations are between two people, so this is very conservative.
1 YB of data at rest
How to construct such a 1 YB archive poses many challenges. One would have to consider a multi-tier/level storage hierarchy made up of both removable and online storage.
Tape or other removable media would be an obvious choice for at least the lowest tier of storage but keeping track of 1.5×10**14 tape volumes (LTO-7 will maybe support 6.4TB (6.4×10**9 bytes per cartridge) seems outside today’s capabilities.
Similar quantities of disk drives would be required to store 1 YB of data but nobody would consider storing all this online. Consider that only 5.4×10**8 disk drives were shipped in 2008 and it becomes obvious that large portions of the 1YB archive must be offline. Deduplication would help but audio and video doesn’t dedupe well.
But that’s nothing, try keeping track of the 10**18 to 10**20 files (assuming 10**6 for audio down to 10**4 for text files of bytes per file).
I think this calls for an object store of some type. 10**6 objects are feasible today scaling up to 10**18 through 10**20 would be a significant leap but not outside technology available 5 (or maybe 10) years hence.
Next one must consider the catalog for such a storage complex. Let’s assume these are conversations and use the 10**18 number, and just keeping 100 bytes of metadata per file, the catalog would take 10**20 bytes of storage. Of course, 100 bytes seems pretty limiting to record all the important data about a conversation or even a text file, so 1000 bytes seems more realistic. Thus, we would need 10**21 bytes of storage just for the catalog. It seems even portions of the catalog would need to be offline to be realistically stored. This would not be optimal but would accommodate a rudimentary listing of the 10**18 element catalog as a last resort.
Searching 1 YB of data
NSA would probably want at least to search the catalog for items of interest, like a person’s name, a phone number, or maybe even time of call. Indexes take anywhere from 20 to 100% of the data being searched. Let’s say with great people working on the project they can get the catalog index down to 10% of the storage being searched. So there is yet another 10**20 bytes of data for the catalog to be searchable. Now we would want the majority of this to be online and directly accessible but even this is 100,000 PB of data. Way beyond today’s capabilities for online accessible storage.
Of course, it’s possible that the agency might want to search the contents of the conversation for items of interest such as words used. Any content index would take vastly more storage than a simple catalog index but maybe this could be shrunk down to only 100% of the catalog size or 10**21 bytes of storage. Again a 1,000,000 PB of data is unlikely to be kept online in total.
I am beginning to see how NSA and Mitre may dave come up with the YB figure. 10**20 for an index 10**21 for a catalog, and another 10**21 for a vocabulary index to 10**18 conversations. Now YB of storage is starting to make sense. If you took the 10**18 conversations down say to 10**15, with a catalog of 10**18 bytes, indexes of 10**19 bytes this might be even more realistic. But, even 10**15 conversations seems a bit much for 2015.
Ingesting, indexing, and protecting 1 YB of storage all pose interesting challenges of their own which I will leave for later posts.