We talked with Hitachi Vantara this past week at a special Tech Field Day extra event (see videos here). This was an all day affair and was a broad discussion of Hitachi’s infrastructure portfolio.
Hitachi has a number of offerings surrounding their content platform, including:
HCP, on premises object store:
HCP Anywhere, enterprise file synch and share using HCP,
HCP Content Intelligence, compliance and content search for HCP object storage, and
HCP Data Ingestor, file gateway to HCP object storage.
I already knew about these offerings but had no idea how successful HCP has been over the years. inng to Hitachi Vantara, HCP has over 4000 installations worldwide with over 2000 customers and is currently the number 1 on premises, object storage solution in the world.
For instance, HCP is installed in 4 out of the 5 largest banks, insurance companies, and TelCos worldwide. HCP Anywhere has over a million users with over 15K in Hitachi alone. Hitachi Vantara has some customers using HCP installations that support 4000-5000 object ingests/sec.
HCP software supports geographically disbursed erasure coding, data compression, deduplication, and encryption of customer object data.
HCP development team has transitioned to using micro services/container based applications and have developed their Foundry Framework to make this easier. I believe the intent is to ultimately redevelop all HCP solutions using Foundry.
Hitachi mentioned a couple of customers:
US Government National Archives which uses HCP behind Pentaho to preserve presidential data and metadata for 100 years, and uses all open APIs to do so
UK Rabo Bank which uses HCP to support compliance monitoring across a number of data feeds
US Ground Support which uses Pentaho, HCP, HCP Content Intelligence and HCP Anywhere to support geospatial search to ascertain boats at sea and what they are doing/shipping.
There’s a lot more to HCP and Hitachi Vantara than summarized here and I would suggest viewing the TFD videos and check out the link above for more information.
Comments?
Want to learn more, see these other TFD bloggers posts:
How will the NSA be able to retrieve anything in this amount of data.
The storage industry must come up with a new term that applies to 10**27 bytes of storage.
As a first stab at this I would suggest NONABYTE (nona- is latin for nine, (y)otta- is italian for eight). In a similar way, perhaps we could use DECEMABYTE for 10**30 and UNDECEMABYTE for 10**33. That should last us for a couple of years.
Storing a yottabyte of data is no small matter. 10 to 100 Petabytes (PB, 10**15 bytes) of data can be dealt with today with a number of storage systems both cloud and non-cloud. Many cloud providers claim PB of storage under their environment so this is entirely feasible today.
Exabytes (XB, 10**18 bytes) would seem to require an offline archive of data. Of course, somebody could conceivably build such an online storage complex (see below for how). Testing such a system might only be possible during implementation but that would not be unusual for such leading edge projects.
Zetabytes (ZB, 10**21 bytes) seems outside the realm of possibility today being a million PB of storage. But offline archives could conceivably be built even for this amount of storage. It’s conceivable that online storage of an XB of data could be used to support offline storage of a ZB of data.
1 YB of data in perspective
Yottabytes of data seem extremely large. If a minute of standard definition digital video takes ~GB of storage, a yottabyte would be about 10**15 minutes of video.
A minute of MP3 audio (as in a phone conversation) takes roughly a MB of storage, so 1 YB would be about 10**18 minutes of conversation. Realize there are only ~6×10**9 people on the planet. So this is enough storage for a ~100 million (10**8) minutes of conversation from everyone on the planet. Seems like a lot, but who am I to judge.
Also realize there are only 5×10**5 minutes/year, so 10**24 would be enough storage to record everything everybody said over ~333 years (mb/minute 10**6 X 6×10**9 people on earth X 5×10**5 minutes per year=3×10**21 bytes required to store one year of everyone talking for the whole year). Also people sleep, don’t often talk 100% of wake time and most conversations are between two people, so this is very conservative.
1 YB of data at rest
How to construct such a 1 YB archive poses many challenges. One would have to consider a multi-tier/level storage hierarchy made up of both removable and online storage.
Tape or other removable media would be an obvious choice for at least the lowest tier of storage but keeping track of 1.5×10**14 tape volumes (LTO-7 will maybe support 6.4TB (6.4×10**9 bytes per cartridge) seems outside today’s capabilities.
Similar quantities of disk drives would be required to store 1 YB of data but nobody would consider storing all this online. Consider that only 5.4×10**8 disk drives were shipped in 2008 and it becomes obvious that large portions of the 1YB archive must be offline. Deduplication would help but audio and video doesn’t dedupe well.
But that’s nothing, try keeping track of the 10**18 to 10**20 files (assuming 10**6 for audio down to 10**4 for text files of bytes per file).
I think this calls for an object store of some type. 10**6 objects are feasible today scaling up to 10**18 through 10**20 would be a significant leap but not outside technology available 5 (or maybe 10) years hence.
Next one must consider the catalog for such a storage complex. Let’s assume these are conversations and use the 10**18 number, and just keeping 100 bytes of metadata per file, the catalog would take 10**20 bytes of storage. Of course, 100 bytes seems pretty limiting to record all the important data about a conversation or even a text file, so 1000 bytes seems more realistic. Thus, we would need 10**21 bytes of storage just for the catalog. It seems even portions of the catalog would need to be offline to be realistically stored. This would not be optimal but would accommodate a rudimentary listing of the 10**18 element catalog as a last resort.
Searching 1 YB of data
NSA would probably want at least to search the catalog for items of interest, like a person’s name, a phone number, or maybe even time of call. Indexes take anywhere from 20 to 100% of the data being searched. Let’s say with great people working on the project they can get the catalog index down to 10% of the storage being searched. So there is yet another 10**20 bytes of data for the catalog to be searchable. Now we would want the majority of this to be online and directly accessible but even this is 100,000 PB of data. Way beyond today’s capabilities for online accessible storage.
Of course, it’s possible that the agency might want to search the contents of the conversation for items of interest such as words used. Any content index would take vastly more storage than a simple catalog index but maybe this could be shrunk down to only 100% of the catalog size or 10**21 bytes of storage. Again a 1,000,000 PB of data is unlikely to be kept online in total.
I am beginning to see how NSA and Mitre may dave come up with the YB figure. 10**20 for an index 10**21 for a catalog, and another 10**21 for a vocabulary index to 10**18 conversations. Now YB of storage is starting to make sense. If you took the 10**18 conversations down say to 10**15, with a catalog of 10**18 bytes, indexes of 10**19 bytes this might be even more realistic. But, even 10**15 conversations seems a bit much for 2015.
Ingesting, indexing, and protecting 1 YB of storage all pose interesting challenges of their own which I will leave for later posts.