StorageTek Automated Cartridge System by brewbooks (cc) (from Flickr)
Was on a conference call today with Oracle’s marketing discussing their tape business. Fred Moore (from Horison Information Systems) was on the call and mentioned something which surprised me. What’s missing in open and distributed systems was some standalone mechanism to stack volumes onto a single tape cartridge.
The advantages of tape are significant, namely:
Low power utilization for offline or nearline storage
Cheap media, drives, and automation systems
Good sequential throughput
Good cartridge density
But most of these advantages fade when cartridge capacity utilization drops. One way to increase cartridge capacity utilization is to stack multiple tape volumes on a single cartridge.
Mainframes (like system/z) have had cartridge stacking since the late 90’s. Such capabilities came about due to the increasing cartridge capacities then available. Advance a decade and the problem still exists, Oracle’s StorageTek T10000 has a 1TB cartridge capacity and LTO-5 supports 1.5TB per cartridge both uncompressed. Nonetheless, open or distributed systems still have no tape stacking capability.
Although I agree with Fred that volume stacking is missing in open systems, but does it really need such a thing. Currently it seems open systems uses tape for backups, archive data and the occasional batch run. Automated hierarchical storage management can readily fill up tape cartridges by holding their data movement to tape until enough data is ready to be moved. On the other hand, backups by their very nature create large sequential streams of data which should result in high capacity utilization except for the last tape in a series. Which only leaves the problem of occasional batch runs using large datasets or files.
I believe most batch processing today already takes place on the mainframe, leaving relatively little for open or distributed systems. There are certainly some verticals that do lots of batch processing, for example banks and telcos. But most heavy batch users grew up in the heyday of the mainframe and are still using them today.
Condor notwithstanding, open and distributed systems never had any sophisticated batch processing capabilities readily available on the mainframe. As such, of those new companies that need batch processing, my guess is that they start with open and as their needs for batch grow move these applications to mainframe.
So the real question becomes how do we increase open systems batch processing. I don’t think a tape volume stacking system solves that problem.
Given all the above, I see tape use in open being relegated to backup and archive and used less and less for any other activities.
I saw a recent IEEE Spectrum article on engineering’s grand challenges for the next century and thought something similar should be done for data storage. So this is a start:
Replace magnetic storage – most predictions show that magnetic disk storage has another 25 years and magnetic tape another decade after that before they run out of steam. Such end-dates have been wrong before but it is unlikely that we will be using disk or tape 50 years from now. Some sort of solid state device seems most probable as the next evolution of storage. I doubt this will be NAND considering its write endurance and other long-term reliability issues but if such issues could be re-solved maybe it could replace magnetic storage.
1000 year storage – paper can be printed today with non-acidic based ink and retain its image for over a 1000 years. Nothing in data storage today can claim much more than a 100 year longevity. The world needs data storage that lasts much longer than 100 years.
Zero energy storage – today SSD/NAND and rotating magnetic media consume energy constantly in order to be accessible. Ultimately, the world needs some sort of storage that only consumes energy when read or written or such storage would provide “online access with offline power consumption”.
Convergent fabrics running divergent protocols – whether it’s ethernet, infiniband, FC, or something new, all fabrics should be able to handle any and all storage (and datacenter) protocols. The internet has become so ubiquitous becauset it handles just about any protocol we throw at it. We need the same or something similar for datacenter fabrics.
Securing data – securing books or paper is relatively straightforward today, just throw them in a vault/safety deposit box. Securing data seems simple but yet is not widely used today. It doesn’t have to be that way. We need better, more long lasting tools and methodology to secure our data.
Public data repositories – libraries exist to provide access to the output of society in the form of books, magazines, papers and other printed artifacts. No such repository exists today for data. Society would be better served if we could store and retrieve data if there were library like institutions could store data. Most of these issues are legal due to data ownership but technological issues exist here as well.
Associative accessed storage – Sequential and random access have been around for over half a century now. Associative storage could complement these and be another approach allowing storage to be retrieved by its content. We can kind of do this today by keywording and indexing data. Biological memory is accessed associations or linkages to other concepts, once accessed memory seem almost sequentially accessed from there. Something comparable to biological memory may be required to build more intelligent machines.
Some of these are already being pursued and yet others receive no interest today. Nonetheless, I believe they all deserve investigation, if storage is to continue to serve its primary role to society, as a long term storehouse for society’s culture, thoughts and deeds.
How will the NSA be able to retrieve anything in this amount of data.
The storage industry must come up with a new term that applies to 10**27 bytes of storage.
As a first stab at this I would suggest NONABYTE (nona- is latin for nine, (y)otta- is italian for eight). In a similar way, perhaps we could use DECEMABYTE for 10**30 and UNDECEMABYTE for 10**33. That should last us for a couple of years.
Storing a yottabyte of data is no small matter. 10 to 100 Petabytes (PB, 10**15 bytes) of data can be dealt with today with a number of storage systems both cloud and non-cloud. Many cloud providers claim PB of storage under their environment so this is entirely feasible today.
Exabytes (XB, 10**18 bytes) would seem to require an offline archive of data. Of course, somebody could conceivably build such an online storage complex (see below for how). Testing such a system might only be possible during implementation but that would not be unusual for such leading edge projects.
Zetabytes (ZB, 10**21 bytes) seems outside the realm of possibility today being a million PB of storage. But offline archives could conceivably be built even for this amount of storage. It’s conceivable that online storage of an XB of data could be used to support offline storage of a ZB of data.
1 YB of data in perspective
Yottabytes of data seem extremely large. If a minute of standard definition digital video takes ~GB of storage, a yottabyte would be about 10**15 minutes of video.
A minute of MP3 audio (as in a phone conversation) takes roughly a MB of storage, so 1 YB would be about 10**18 minutes of conversation. Realize there are only ~6×10**9 people on the planet. So this is enough storage for a ~100 million (10**8) minutes of conversation from everyone on the planet. Seems like a lot, but who am I to judge.
Also realize there are only 5×10**5 minutes/year, so 10**24 would be enough storage to record everything everybody said over ~333 years (mb/minute 10**6 X 6×10**9 people on earth X 5×10**5 minutes per year=3×10**21 bytes required to store one year of everyone talking for the whole year). Also people sleep, don’t often talk 100% of wake time and most conversations are between two people, so this is very conservative.
1 YB of data at rest
How to construct such a 1 YB archive poses many challenges. One would have to consider a multi-tier/level storage hierarchy made up of both removable and online storage.
Tape or other removable media would be an obvious choice for at least the lowest tier of storage but keeping track of 1.5×10**14 tape volumes (LTO-7 will maybe support 6.4TB (6.4×10**9 bytes per cartridge) seems outside today’s capabilities.
Similar quantities of disk drives would be required to store 1 YB of data but nobody would consider storing all this online. Consider that only 5.4×10**8 disk drives were shipped in 2008 and it becomes obvious that large portions of the 1YB archive must be offline. Deduplication would help but audio and video doesn’t dedupe well.
But that’s nothing, try keeping track of the 10**18 to 10**20 files (assuming 10**6 for audio down to 10**4 for text files of bytes per file).
I think this calls for an object store of some type. 10**6 objects are feasible today scaling up to 10**18 through 10**20 would be a significant leap but not outside technology available 5 (or maybe 10) years hence.
Next one must consider the catalog for such a storage complex. Let’s assume these are conversations and use the 10**18 number, and just keeping 100 bytes of metadata per file, the catalog would take 10**20 bytes of storage. Of course, 100 bytes seems pretty limiting to record all the important data about a conversation or even a text file, so 1000 bytes seems more realistic. Thus, we would need 10**21 bytes of storage just for the catalog. It seems even portions of the catalog would need to be offline to be realistically stored. This would not be optimal but would accommodate a rudimentary listing of the 10**18 element catalog as a last resort.
Searching 1 YB of data
NSA would probably want at least to search the catalog for items of interest, like a person’s name, a phone number, or maybe even time of call. Indexes take anywhere from 20 to 100% of the data being searched. Let’s say with great people working on the project they can get the catalog index down to 10% of the storage being searched. So there is yet another 10**20 bytes of data for the catalog to be searchable. Now we would want the majority of this to be online and directly accessible but even this is 100,000 PB of data. Way beyond today’s capabilities for online accessible storage.
Of course, it’s possible that the agency might want to search the contents of the conversation for items of interest such as words used. Any content index would take vastly more storage than a simple catalog index but maybe this could be shrunk down to only 100% of the catalog size or 10**21 bytes of storage. Again a 1,000,000 PB of data is unlikely to be kept online in total.
I am beginning to see how NSA and Mitre may dave come up with the YB figure. 10**20 for an index 10**21 for a catalog, and another 10**21 for a vocabulary index to 10**18 conversations. Now YB of storage is starting to make sense. If you took the 10**18 conversations down say to 10**15, with a catalog of 10**18 bytes, indexes of 10**19 bytes this might be even more realistic. But, even 10**15 conversations seems a bit much for 2015.
Ingesting, indexing, and protecting 1 YB of storage all pose interesting challenges of their own which I will leave for later posts.
HP LTO 4 Tape MediaIn my past life, I worked for a dominant tape vendor. Over the years, we had heard a number of times that tape was dead. But it never happened. BTW, it’s also not happening today.
Just a couple of weeks ago, I was at SNW and vendor friend of mine asked if I knew anyone with tape library expertise because they were bidding on more and more tape archive opportunities. Tape seems alive and kicking for what I can see.
However, the fact is that tape use is being repositioned. Tape is no longer the direct target for backups that it once was. Most backup packages nowadays backup to disk and then later, if at all, migrate this data to tape (D2D2T). Tape is being relegated to a third tier of storage, a long-term archive and/or a long term backup repository.
The economics of tape are not hard to understand. You pay for robotics, media and drives. Tape, just like any removable media requires no additional power once it’s removed from the transport/drive used to write it. Removable media can be transported to an offsite repository or accross the continent. There it can await recall with nary an ounce (volt) of power consumed.
Problems with tape
So what’s wrong with tape, why aren’t more shops using it. Let me count the problems
Tape, without robotics, requires manual intervention
Tape, because of its transportability, can be lost or stolen, leading to data security breaches
Tape processing, in general, is more error prone than disk. Tape can have media and drive errors which cause data transfer operations to fail
Tape is accessed sequentially, it cannot be randomly accessed (quickly) and only one stream of data can be accepted per drive
Much of a tape volume is wasted, never written space
Tape technology doesn’t stay around forever, eventually causing data obsolescence
Tape media doesn’t last forever, causing media loss and potentially data loss
Likely some other issues with tape missed here, but these seem the major ones from my perspective.
It’s no surprise that most of these problems are addressed or mitigated in one form or another by the major tape vendors, software suppliers and others interested in continuing tape technology.
Robotics can answer the manual intervention, if you can afford it. Tape encryption deals effectively with stolen tapes, but requires key management somewhere. Many applications exist today to help predict when media will go bad or transports need servicing. Tape data, is and always will be, accessed sequentially, but then so is lot’s of other data in today’s IT shops. Tape transports are most definitely single threaded but sophisticated applications can intersperse multiple streams of data onto that single tape. Tape volume stacking is old technology, not necessarily easy to deploy outside of some sort of VTL front-end, but is available. Drive and media technology obsolescence will never go away, but this indicates a healthy tape market place.
Future of tape
Say what you will about Ultrium or the Linear Tape-Open (LTO) technology, made up of HP, IBM, and Quantum research partners, but it has solidified/consolidated the mid-range tape technology. Is it as advanced as it could be, or pushing to open new markets – probably not. But they are advancing tape technology providing higher capacity, higher performance and more functionality over recent generations. And they have not stopped, Ultrium’s roadmap shows LTO-6 right after LTO-5 and delivery of LTO-5 at 1.6TB uncompressed capacity tape, is right around the corner.
Also IBM and Sun continue to advance their own proprietary tape technology. Yes, some groups have moved away from their own tape formats but that’s alright and reflects the repositioning that’s happening in the tape marketplace.
As for the future, I was at an IEEE magnetics meeting a couple of years back and the leader said that tape technology was always a decade behind disk technology. So the disk recording heads/media in use today will likely see some application to tape technology in about 10 years. As such, as long as disk technology advances, tape will come out with similar capabilities sometime later.
Still, it’s somewhat surprising that tape is able to provide so much volumetric density with decade old disk technology, but that’s the way tape works. Packing a ribbon of media around a hub, can provide a lot more volumetric storage density than a platter of media using similar recording technology.
In the end, tape has a future to exploit if vendors continue to push its technology. As a long term archive storage, it’s hard to beat its economics. As a backup target it may be less viable. Nonetheless, it still has a significant install base which turns over very slowly, given the sunk costs in media, drives and robotics.
Full disclosure: I have no active contracts with LTO or any of the other tape groups mentioned in this post.
There was a time not long ago when the title of this post wouldn’t have included SSD. But, with the history of the last couple of years, SSD has earned its right to be included.
A couple of years back I was at a Rocky Mountain Magnetics Seminar (see IEEE magnetics societies) and a disk drive technologist stated that Disks have about another 25 years of technology roadmap ahead of them. During this time they will continue to increase density, throughput and other performance metrics. After 25 years of this they will run up against some theoretical limits which will halt further density progress.
At the same seminar, the presenter said that Tape was lagging Disk technology by about 5-10 years or so. As such, tape should continue to advance for another 5-10 years after disk stops improving at which time tape would also stop increasing density.
Does all this mean the end of tape and disk? I think not. Paper stopped advancing in density theoretically about 2 to 3000 years ago (the papyrus scroll was the ultimate in paper “rotating media”). If we move up to the codex or book form- which in my view is a form factor advance – this took place around 400AD (see history of scroll and codex). Paperback, another form factor advance, took place in the early 20th century (see paperback history).
Turning now to write performance, moveable type was a significant paper (write) performance improvement and started in the mid 15th century. The printing press would go on to improve (paper write) performance for the next six centuries (see printing press history) and continues today.
All this indicates that some data technology, whose density was capped over 2000 years ago, can continue to advance and support valuable activity in today’s world and for the foreseeable future. “Will disk and tape go away” is the wrong question, the right question is “can disk or tape, after SSDs reach price equivalence on a $/GB basis, still be useful to the world”?
I think yes, but that depends on a number of factors as to how the relative SSD-Disk-Tape technologies advance. Assuming someday all these technologies support equivalent Tb/SqIn or spatial density and
SSD’s retain their relative advantage in random access speed,
Tape it’s advantage in sequential throughput, volumetric density, and long media life, and
Disk it’s all around, combined sequential and random access advantage
It seems likely that each can sustain some niche in the data center/home office of tomorrow, although probably not where they are today.
One can see trends being enacted in the enterprise data centers today that are altering the relative positioning of SSDs, disks and tape. Tape is now being relegated to long term, archive storage, Disk is moving to medium-term, secondary storage and SSDs is replacing top tier, primary storage.