A recent article from MIT’s Technology Review discussed cloud security (“Security in the Ether”). Most of the article was on how many cloud servers are vulnerable to a particular hack that can uncover private data in server memory/cache. But a good portion of the article was on how to secure data in the cloud and the article discussed a couple of new ideas (to me at least):
Securing cloud data access by using a key hierarchy – in this way a particular file/table/row could have a hierarchy of keys and thus, could have one master key for the whole datum and subset keys which would provide access to segments of the datum. As such, the patient could hold the master key to their electronic health records while their physicians held subset keys that would allow them to access diagnostic results and other information needed to treat the patient.
Securing cloud data search by encrypting meta-data – in this way a search key could be encrypted and then the search could execute in the cloud against the encrypted meta-data. As such, meta data and search keys would need to be encrypted in a static fashion so that they would always encrypt to the same cipher text but this could be done with an MD5 hash. Not sure how this might help sorting but it’s certainly a step in the right direction as searches could be performed completely secure while using cloud resources. Subsequent search results could then be easily delivered back to the end user for decryption and use.
Securing cloud data manipulation by using “ideal lattice” calculations on encrypted data – in this way mathematical manipulations of encrypted data are possible and can be extracted from the cloud for decryption and use. As such, data queries using arithmetic functions such as summing a column of cloud data, can be completely secured and the resultant summation delivered outside the cloud. How this works is beyond me and the mathematics are said to be a bit cumbersome but, it’s still early and may someday become a viable approach.
It seems to me most of this goes way beyond the data archive I would envision for the cloud. With such encryption techniques one could conceivably host one’s data center applications in the cloud and/or use the cloud to serve as data storage for all applications. While this may be the ultimate goal for the cloud it still seems a way off.
So what mathematical functions can be accomplished using an “ideal lattice”?
How will the NSA be able to retrieve anything in this amount of data.
The storage industry must come up with a new term that applies to 10**27 bytes of storage.
As a first stab at this I would suggest NONABYTE (nona- is latin for nine, (y)otta- is italian for eight). In a similar way, perhaps we could use DECEMABYTE for 10**30 and UNDECEMABYTE for 10**33. That should last us for a couple of years.
Storing a yottabyte of data is no small matter. 10 to 100 Petabytes (PB, 10**15 bytes) of data can be dealt with today with a number of storage systems both cloud and non-cloud. Many cloud providers claim PB of storage under their environment so this is entirely feasible today.
Exabytes (XB, 10**18 bytes) would seem to require an offline archive of data. Of course, somebody could conceivably build such an online storage complex (see below for how). Testing such a system might only be possible during implementation but that would not be unusual for such leading edge projects.
Zetabytes (ZB, 10**21 bytes) seems outside the realm of possibility today being a million PB of storage. But offline archives could conceivably be built even for this amount of storage. It’s conceivable that online storage of an XB of data could be used to support offline storage of a ZB of data.
1 YB of data in perspective
Yottabytes of data seem extremely large. If a minute of standard definition digital video takes ~GB of storage, a yottabyte would be about 10**15 minutes of video.
A minute of MP3 audio (as in a phone conversation) takes roughly a MB of storage, so 1 YB would be about 10**18 minutes of conversation. Realize there are only ~6×10**9 people on the planet. So this is enough storage for a ~100 million (10**8) minutes of conversation from everyone on the planet. Seems like a lot, but who am I to judge.
Also realize there are only 5×10**5 minutes/year, so 10**24 would be enough storage to record everything everybody said over ~333 years (mb/minute 10**6 X 6×10**9 people on earth X 5×10**5 minutes per year=3×10**21 bytes required to store one year of everyone talking for the whole year). Also people sleep, don’t often talk 100% of wake time and most conversations are between two people, so this is very conservative.
1 YB of data at rest
How to construct such a 1 YB archive poses many challenges. One would have to consider a multi-tier/level storage hierarchy made up of both removable and online storage.
Tape or other removable media would be an obvious choice for at least the lowest tier of storage but keeping track of 1.5×10**14 tape volumes (LTO-7 will maybe support 6.4TB (6.4×10**9 bytes per cartridge) seems outside today’s capabilities.
Similar quantities of disk drives would be required to store 1 YB of data but nobody would consider storing all this online. Consider that only 5.4×10**8 disk drives were shipped in 2008 and it becomes obvious that large portions of the 1YB archive must be offline. Deduplication would help but audio and video doesn’t dedupe well.
But that’s nothing, try keeping track of the 10**18 to 10**20 files (assuming 10**6 for audio down to 10**4 for text files of bytes per file).
I think this calls for an object store of some type. 10**6 objects are feasible today scaling up to 10**18 through 10**20 would be a significant leap but not outside technology available 5 (or maybe 10) years hence.
Next one must consider the catalog for such a storage complex. Let’s assume these are conversations and use the 10**18 number, and just keeping 100 bytes of metadata per file, the catalog would take 10**20 bytes of storage. Of course, 100 bytes seems pretty limiting to record all the important data about a conversation or even a text file, so 1000 bytes seems more realistic. Thus, we would need 10**21 bytes of storage just for the catalog. It seems even portions of the catalog would need to be offline to be realistically stored. This would not be optimal but would accommodate a rudimentary listing of the 10**18 element catalog as a last resort.
Searching 1 YB of data
NSA would probably want at least to search the catalog for items of interest, like a person’s name, a phone number, or maybe even time of call. Indexes take anywhere from 20 to 100% of the data being searched. Let’s say with great people working on the project they can get the catalog index down to 10% of the storage being searched. So there is yet another 10**20 bytes of data for the catalog to be searchable. Now we would want the majority of this to be online and directly accessible but even this is 100,000 PB of data. Way beyond today’s capabilities for online accessible storage.
Of course, it’s possible that the agency might want to search the contents of the conversation for items of interest such as words used. Any content index would take vastly more storage than a simple catalog index but maybe this could be shrunk down to only 100% of the catalog size or 10**21 bytes of storage. Again a 1,000,000 PB of data is unlikely to be kept online in total.
I am beginning to see how NSA and Mitre may dave come up with the YB figure. 10**20 for an index 10**21 for a catalog, and another 10**21 for a vocabulary index to 10**18 conversations. Now YB of storage is starting to make sense. If you took the 10**18 conversations down say to 10**15, with a catalog of 10**18 bytes, indexes of 10**19 bytes this might be even more realistic. But, even 10**15 conversations seems a bit much for 2015.
Ingesting, indexing, and protecting 1 YB of storage all pose interesting challenges of their own which I will leave for later posts.