The future of libraries

Vista de la Biblioteca Vasconcelos by Eneas (cc) (from flickr)
Vista de la Biblioteca Vasconcelos by Eneas (cc) (from flickr)
My recent post on an exabyte-a-day generated a comment that got me thinking. What we need in the world today is a universal deduped archive. Such an archive would be a repository for all information generated by the world, nation, state, etc. and would automatically deduplicate the data and back it up.

Such an archive could be a new form of the current library. Keeping data for future generations and also for a nation’s population. Data held in the library repository would need to have:

  • Iron-clad data security via some form of data-at-rest encryption. This is a bit tricky since we would want to dedupe all the data from everywhere yet at the same time have the data be encrypted.
  • Enforceable digital rights management that would allow authorized users data access but unauthorized users would be restricted from viewing the information
  • Easy accessibility that would allow home consumers access to their data in an “always on” type of environment or access from any internet enabled location.
  • Dependable backups that would allow user restore of data.
  • Time limited protection scheme that after so many years (60 or 100) of data non-access/non-modification, the data would revert to public access/non-secured access for future research.
  • Government funding akin to today’s libraries that are publicly funded but serve those consumers that take the time to access their library facilities.

I see this as another outgrowth of current libraries which supports a repository for todays books, magazines, media, maps, and other published artifacts. However, in this case most data would not be published during a person’s lifetime but would become public property sometime after that person dies.

Benefits to society and the individual

Of what use could such a data repository be? Once the data becomes publicly accessible:

  • Future historians could find out what life was really like, in a detail never before available. Find out what people were watching/listening to, who people wrote to/conversed with, and what people cared about in the 21st century by perusing the data feeds of that generation.
  • Future scientists could mine the data for insights into a generation, network links, and personal data consumption.
  • Future governments could mine the data looking for what people thought about a nation, its economy, politics, etc., to help create better government.

But mostly, we don’t know what future researchers could do with the data. If such a repository existed today for what people were thinking and doing 60 to 100 years ago, history would be much more person derived rather than media derived. Economists would have a much more accurate picture of the great depression’s affect on humankind. Medicine would have a much better picture of how the pollutants and lifestyles of yesterday impact the health of today.

Also, as more and more of society’s activity involve data, the detail available on a person’s life becomes even more pervasive. Consider medical imaging, if you had a repository for a person’s x-rays from birth to death, this data could potentially be invaluable to the medicine of tomorrow.

While the data is still protected people

  • Would have a secure repository to store all their data, accessible from any internet enabled location
  • Would have an unlimited repository for their data storage not unlike timemachine on the Mac which they could go back to at anytime in the past to retrieve data.
  • Would have the potential to record even more information about their daily activities.
  • Would have a way to license their data feeds to researchers for a price sort of like registering for Nielsen TV or Alexa web tracking.

Costs to society

The price society would pay could be minimized by appropriate storage and systems technology. If in reality the data created by individuals (~87PB/day from the above mentioned post) could be deduped by a factor of 50X, this would account for only 1.7PB of unique data per day worldwide. If I take a nation’s portion of world GDP as a surrogate for data created by a nation, then for the US with 23.6% of the world’s ’08 GDP, creates ~0.4PB of individual deduped data per day or ~150PB of data per year.

Of course this would be split up by state or by municipality so the load on any one juristiction would be considerably smaller than this. But storing 150PB of data today would take 75K-2TB drives and would cost about ~$15.8M in drive costs (2TB WD drive costs $210 on Amazon) in the US. This does not account for servers, backups, power, cooling, floorspace, administration, etc but let’s triple this to incorporate these other costs. So to store all the data created by individuals in the US in 2009 would cost around $47.4M today with today’s technology.

Also consider that this cost is being cut in half every 18 to 24 months but counteracting that trend is a significant growth in data created/stored by individuals each year (~50%). Hence, by my calculations, the cost to store all this data is declining slightly every year depending on the speed of density increase and average individual data growth rate.

In any event, $47.4M is not a lot to spend to keep a nation’s worth of individual data. The benefits to today’s society would be considerable and future generations would have a treasure trove of data to analyze whenever the need presented itself.

Holding this back today is the obvious cost but also all of the data security considerations. I believe the costs are manageable, at least at the state or municipal level. As for the data security considerations, simple data-at-rest encryption is one viable solution. Although how to encrypt while still providing deduplication is a serious problem to be overcome. Enforceable digital rights, time limited protection, and the other technological features could come with time.

An Exabyte-a-day

snp microarray data by mararie (cc) (from flickr)
snp microarray data by mararie (cc) (from flickr)

At HPTechDay this week Jim Pownell, office of CTO, HP StorageWorks Division, reported on an IDC study that said this year the world is creating about an Exabyte of data each day.  An Exabyte (XB) is 10**18 bytes or 1000 PB of data.  Seems a bit high from my perspective.

Data creation by individuals

Population Growth and Income Level Chart by mattlemmon (cc) (from flickr)
Population Growth and Income Level Chart by mattlemmon (cc) (from flickr)

The US Census bureau estimates todays worldwide population at around 6.8 Billion people. Given that estimate, the XB/day number says that the average person is creating about 150MB/day.

Now I don’t know about you but we probably create that much data during our best week. That being said our family average over the last 3.5 years is more like 30.1MB/day. This average, over the last year, has been closer to 75.1MB/day (darn new digital camera).

If I take our 75.1 MB/day as a reasonable approximate average for our family and with 2 adults in our family, this would say each adult creates ~37.6MB of data per day.

Probably about 50% of todays world wide population probably has no access to create any data whatsoever. Of the remaining 50%, maybe 33% is at an age where data creation is insignificant. All this leaves about 2.3B people actively creating data at around 37.6MB/day. This would account for about 86.5PB of data creation a day.

Naturally, I would consider myself a power data creator but

  • We are not doing much with video production which takes creates gobs of data.
  • Also, my wife retains camera rights and I only take the occasional photo with my cell phone. So I wouldn’t say we are heavy into photography.

Nonetheless, 37.6MB/day on average seems exceptionally high, even for us.

Data creation by companies

However, that XB a day also accounts for corporate data generation as well as individuals. Hoovers, a US corporate database lists about 33M companies worldwide. These are probably the biggest 33M and no doubt creating lot’s of data each day.

Given the above that individuals probably account for 86.5PB/day, that leaves about ~913.5PB/day for the Hoover’s DB of 33M companies to create. By my calculations this would say each of these companies is generating about ~27.6GB/day. No doubt there are plenty of companies out there doing this each day but the average company generates 27.6GB a day?? I don’t think so.

Ok, my count of companies could be wildly off. Perhaps the 33M companies in Hoover’s DB represent only the top 20% of companies worldwide, which means that maybe there are another 132M smaller companies out there totaling 165M companies. Now the 913.5PB/day says the average company generates ~5.5GB/day. This still seems high to me, especially considering this is an average of all 165M companies world wide.

Most analysts predict data creation is growing by over 100% per year, so that XB/day number for this year will be 2XB/day next year.

Of course I have been looking at a new HD video camera for my birthday…

Sony_HDR-TG5V_Vanity350
Sony_HDR-TG5V_Vanity350

The price of quality

At HPTechDay this week we had a tour of the EVA test lab, in the south building of HP’s Colorado Springs Facility. I was pretty impressed and I have seen more than my fair share of labs in my day.

Tony Green HP's EVA Lab Manager
Tony Green HP's EVA Lab Manager
The fact that they have 1200 servers and 500 EVA arrays was pretty impressive but they also happen to have about 20PB of storage over that 500 arrays. In my day a couple of dozen arrays and a 100 or so servers seemed to be enough to test a storage subsystem.

Nowadays it seems to have increased by an order of magnitude. Of course they have sold something like 70,000 EVAs over the years and some of these 500 arrays happen to be older subsystems used to validate problems and debug issues for current field population.

Another picture of the EVA lab with older EVAs
Another picture of the EVA lab with older EVAs

They had some old Compaq equipment there but I seem to have flubbed the picture of that equipment. This one will have to suffice. It seems to have both vertically and horizontally oriented drive shelves. I couldn’t tell you which EVAs these were but as they were earlier in the tour, I figured they were older equipment. It seemed as you got farther into the tour you moved closer to the current iterations of EVA. It seemed like an archive dig in reverse instead of having the most current layers/levels first they were last.

I asked Tony how many FC ports he had and he said it was probably easiest to count the switch ports and double them but something in the thousands seemed reasonable.

FC switch rack with just a small selection of switch equipment
FC switch rack with just a small selection of switch equipment

There were parts of the lab which were both off limits to cameras and to bloggers which was deep into the bowels of the lab. But we were talking about some of the remote replication support that EVA had and how they tested this over distance. Tony said they had to ship their reel of 100 miles of FC up north (probably for some other testing) but he said they have a surragate machine which can be programmed to create the proper FC delay to meet any required distances.

FC delay generator box
FC delay generator box

The blue box in the adjacent picture seemed to be this magic FC delay inducer box. Had interesting lights on it.

Nigel Poulton of Ruptured Monkeys and Devang Panchigar of StorageNerve Blog were also on the tour taking pictures&video. You can barely make out Devang in the picture next to Nigel. Calvin Zito from HP StorageWorks Blog was also on tour but not in any of my pictures.

Nigel and Devang (not pictured) taking videos on EVA lab tour
Nigel and Devang (not pictured) taking videos on EVA lab tour

Throughout our tour of the lab I can say I only saw one logic analyzer although I am sure there were plenty more in the off limits area.

Lonely logic analyzer in EVA lab
Lonely logic analyzer in EVA lab
During HPTechDay they hit on the topic of storage-server convergence and the use of commodity, X86 hardware for future storage systems. From the lack of logic analyzers I would have to concur with this analysis.

Nonetheless, I saw some hardware workstations although this was another lonely workstation sorrounded in a sea of EVAs.

Hardware workstation in the EVA lab, covered in parts and HW stuff
Hardware workstation in the EVA lab, covered in parts and HW stuff
Believe it or not I actually saw one stereo microscope but failed to take a picture of it. Yet another indicator of hardware descent and my inadequacies as a photographer.

One picture of an EVA obviously undergoing some error injection test with drives tagged as removed and being rebuilt or reborn as part of RAID testing.

Drives tagged for removal during EVA test
Drives tagged for removal during EVA test
In my day we would save particularly “squirrelly drives” from the field and use them to verify storage subsystem error handling. I would bet anything these tagged drives had specific error injection points used to validate EVA drive error handling.

I could go on and I have a couple of more decent lab pictures but you get the jist of the tour.

For some reason I enjoy lab tours. You can tell a lot about an organization by how their labs look, how they are manned, organized and set up. What HP’s EVA lab tells me is that they spare no expense to insure their product is literally bulletproof, bug proof, and works every time for their customer base. I must say I was pretty impressed.

At the end of HPTechDay event Greg Knieriemen of Storage Monkeys and Stephen Foskett of GestaltIT hosted an InfoSmack podcast to be broadcast next Sunday 10/4/2009. There we talked a little more on commodity hardware versus purpose built storage subsystem hardware, it was a brief, but interesting counterpoint to the discussions earlier in the week and the evidence from our portion of the lab tour.

SSD shipments start to take off

3 rack V-Max storage subsystem from EMC
3 rack V-Max storage subsystem from EMC

Was on an analyst call today where Bob Wambach of EMC was discussing their recent success with V-Max their newest version of their highly successful Symmetrix storage subsystem. But what was more interesting was their announcement of having sold 1PB of enterprise flash storage on Symmetrix and almost 2PB total across all EMC product lines in 1H09. Symmetrix SSD shipments includes both DMX and V-Max installs. During 1H09, EMC shipped both 146GB and 400GB SSDs, so it’s hard to put a number of drives on this capacity but at 146GBs 1PB of Symmetrix SSD this would be around 6.9K SSDs and for all SSDs a maximum of ~14K drives.

SSD drive shipments vs. hard drives

To put this in comparison, ~540M hard drive were shipped in 2008 and with a ~7% decline in 2009 this should equate to around 502M drive shipments in 2009. But this includes all drives and as such, if 15-20% of these were data center storage, then ~75 to ~100M data center hard drives will be shipped in ’09. Looking at just the first half, probably close to 40% of the whole year, then ~30-40M data center hard drives were shipped across the industry in 1H09. In Q2’09 EMC had a 22.4% revenue storage market share, using this market share for all of 1H09, this means they probably shipped ~7.8M data center hard drives during 1H09 (assuming revenue correlates with drive shipments). Hence, 14K SSDs represents a very small but growing proportion (<0.2%) of all drives sold by EMC.

Of course this is just the start

On the analyst call today EMC provided a couple of examples of recent SSD installations. In one example a customer was looking at a US$3M mainframe upgrade but instead went with a $500K SSD upgrade. EMC was able to examine their current storage, identify their hottest, most active LUNs and convert these to SSDs. Once this was done, EMC was able to solve the customers performance problems which allowed them to defer the mainframe upgrade.

Data center access patterns

Some statistics from an EMC year long, data center study analyzing detail IO traces from around 600 data centers, show that over 60% of data center data is infrequently accessed. EMC believes such data can best be left on high capacity SATA drives. As for the rest, it wouldn’t surprise me if 15-20% is accessed frequently enough to reside on SSD/flash drives for improved performance and the remaining 25-20% probably best be served today left on FC drives.

Nowadays, EMC goes through a manual analysis to identify which data to place on SSDs but in the near future their FAST (Fully Automated Storage Tiering) software will be able to migrate data to the right performing storage tier automatically. With FAST in place, supposedly one only needs to upgrade to SSDs and turn FAST on, after that it will analyze your data reference patterns and automatically move your performance critical data to SSDs.

The coming SSD world

So, SSDs are starting to be adopted, by organizations both large and small. Perhaps current SSD drive shipments are insignificant compared to hard drives, but given today’s realities of data use there seems no reason that SSD adoption can’t accelerate and someday claim 10% or more of all data center drive shipments. Hence, at todays numbers, this means almost 10 million SSDs being shipped each year.

The coming hard drive capacity wall?

A head assembly on a Seagate disk drive by Robert Scoble (cc) (from flickr)
A head assembly on a Seagate disk drive by Robert Scoble (cc) (from flickr)
Hard drives have been on a capacity tear lately what with perpendicular magnetic recording and tunneling magnetoresistive heads. As evidence of this, Seagate just announced their latest Barracuda XT, a 2TB hard drive with 4 platters with ~500GB/platter at 368Gb/sqin recording density.

Read-head technology limits

Recently, I was at a Rocky Mountain IEEE Magnetics Society seminar where Bruce Gurney, Ph. D., from Hitachi Global Storage Technologies (HGST) said there was no viable read head technology to support anything beyond 1Tb/sqin recording densities. Now in all fairness this was a public lecture and not under NDA but it’s obvious the (read head) squeeze is on.

Assuming, it’s a relatively safe bet that densities achieving ~1Tb/sqin can be attained with today’s technology, that means another ~3X in drive capacity is achievable using current read-heads. Hence, for a 4 platter configuration, we can expect a ~6TB drive in the near future using today’s read-heads.

But what happens next?

Unclear to me how quickly drive vendor deliver capacity increases these days, but in recent past a doubling in capacity occurred every 18-24 months. Assuming this holds for today’s technology, somewhere in the next 24-36 months or so we will see a major transition to a new read-head technology.

Thus, for today’s drive industry must spend major R&D $’s to discover, develop, and engineer a brand new read-head technology to take drive capacity to the next chapter. What this means for write heads, media smoothness, and head flying height is unknown.

Read-head development

At the seminar mentioned above, Bruce had some interesting charts which showed how long previous read-head technology took to develop and how long they were used to produce drives. According to Bruce, recently it’s been taking about 9 years for read-head technology from discovery to drive production. However, while in the past a new read-head technology would last for 10 years or more in production, nowadays they appear to be lasting only 5 to 6 years before read-head technology changes in production. Thus, it takes twice as long to develop read-head technology as its used in production, which means that the R&D expense must be amortized over a shorter timeframe. If anything, from my perspective, the production runs for new read-head technology seems to be getting shorter?!

Nonetheless, most probably, HGST and others have a new head up their sleeve but are not about to reveal it until the time comes to bring it out in a new hard drive.

Elephant in the room

But that’s not the problem. If production runs continue to shrink in duration and the cost and time of developing new heads doesn’t shrink accordingly someday the industry must eventually reach a drive capacity wall. Now this won’t be because some physical magnetic/electronic/optical constraint has been reached but because a much more fundamental, inherent economic constraint has been reached – it just costs too much to develop new read-head technology and no one company can afford it.

There are a couple of ways out of this death spiral that I see

  • Lengthen read-head production duration,
  • Decrease the cost/time to develop new read-heads
  • Create a industry wide read-head technology consortium that can track and fund future read-head development

More than likely we will need some combination of all of these solutions if the drive industry is to survive for long.

Chart of the month: SPC-1 LRT performance results

Chart of the Month: SPC-1 LRT(tm) performance resultsThe above chart shows the top 12 LRT(tm) (least response time) results for Storage Performance Council’s SPC-1 benchmark. The vertical axis is the LRT in milliseconds (msec.) for the top benchmark runs. As can be seen the two subsystems from TMS (RamSan400 and RamSan320) dominate this category with LRTs significantly less than 2.5msec. IBM DS8300 and it’s turbo cousin come in next followed by a slew of others.

The 1msec. barrier

Aside from the blistering LRT from the TMS systems one significant item in the chart above is that the two IBM DS8300 systems crack the <1msec. barrier using rotating media. Didn’t think I would ever see the day, of course this happened 3 or more years ago. Still it’s kind of interesting that there haven’t been more vendors with subsystems that can achieve this.

LRT is probably most useful for high cache hit workloads. For these workloads the data comes directly out of cache and the only thing between a server and it’s data is subsystem IO overhead, measured here as LRT.

Encryption cheap and fast?

The other interesting tidbit from the chart is that the DS5300 with full drive encryption (FDE), (drives which I believe come from Seagate) cracks into the top 12 at 1.8msec exactly equivalent with the IBM DS5300 without FDE. Now FDE from Seagate is a hardware drive encryption capability and might not be measurable at a subsystem level. Nonetheless, it shows that having data security need not reduce performance.

What is not shown in the above chart is that adding FDE to the base subsystem only cost an additional US$10K (base DS5300 listed at US$722K and FDE version at US$732K). Seems like a small price to pay for data security which in this case is simply turn it on, generate keys, and forget it.

FDE is a hard drive feature where the drive itself encrypts all data written and decrypts all data read to from a drive and requires a subsystem supplied drive key at power on/reset. In this way the data is never in plaintext on the drive itself. If the drive were taken out of the subsystem and attached to a drive tester all one would see is ciphertext. Similar capabilities have been available in enterprise and SMB tape drives is the past but to my knowledge the IBM DS5300 FDE is the first disk storage benchmark with drive encryption.

I believe the key manager for the DS5300 FDE is integrated within the subsystem. Most shops would need a separate, standalone key manager for more extensive data security. I believe the DS5300 can also interface with an standalone (IBM) key manager. In any event, it’s still an easy and simple step towards increased data security for a data center.

The full report on the latest SPC results will be up on my website later this week but if you want to get this information earlier and receive your own copy of our newsletter – email me at SubscribeNews@SilvertonConsulting.com?Subject=Subscribe_to_Newsletter.

Why SSD performance is a mystery?

SSDs! :) by gimpbully (cc) (from flickr)
SSDs! 🙂 by gimpbully (cc) (from flickr)

SSD and/or SSS (solid state storage) performance is a mystery to most end-users. The technology is inherently asymmetrical, i.e., it reads much faster than it writes. I have written on some of these topics before (STEC’s new MLC drive, Toshiba’s MLC flash, Tape V Disk V SSD V RAM) but the issue is much more complex when you put these devices behind storage subsystems or in client servers.

Some items that need to be considered when measuring SSD/SSS performance include:

  • Is this a new or used SSD?
  • What R:W ratio will we use?
  • What blocksize should be used?
  • Do we use sequential or random I/O?
  • What block inter-reference interval should be used?

This list is necessarily incomplete but it’s representative of the sort of things that should be considered to measure SSD/SSS performance.

New device or pre-conditioned

Hard drives show little performance difference whether new or pre-owned, defect skips notwithstanding. In contrast, SSDs/SSSs can perform very differently when they are new versus when they have been used for a short period depending on their internal architecture. A new SSD can write without erasure throughout it’s entire memory address space but sooner or later wear leveling must kick in to equalize the use of the device’s NAND memory blocks. Wear leveling causes both reads and rewrites of data during it’s processing. Such activity takes bandwidth and controller processing away from normal IO. If you have a new device it may take days or weeks of activity (depending on how fast you write) to attain the device’s steady state where each write causes some sort of wear leveling activity.

R:W Ratio

Historically, hard drives have had slightly slower write seeks than reads, due to the need to be more accurately positioned to write data than to read it. As such, it might take .5msec longer to write than to read 4K bytes. But for SSDs the problem is much more acute, e.g. read times can be in microseconds while write times can almost be in milliseconds for some SSDs/SSSs. This is due to the nature of NAND flash, having to erase a block before it can be programmed (written) and the programming process taking a lot’s longer than a read.

So the question for measuring SSD performance is what read to write (R:W) ratio to use. Historically a R:W of 2:1 was used to simulate enterprise environments but most devices are starting to see more like 1:1 for enterprise applications due to the caching and buffering provided by controllers and host memory. I can’t speak as well for desktop environments but it wouldn’t surprise me to see 2:1 used to simulate desktop workloads as well.

SSDs operate a lot faster if their workload is 1000:1 than for 1:1 workloads. Most SSD data sheets tout a significant read I/O rate but only for 100% read workloads. This is like a subsystem vendor quoting a 100% read cache hit rate (which some do) but is unusual in the real world of storage.

Blocksize to use

Hard drives are not insensitive to blocksizes, as blocks can potentially span tracks which will require track-to-track seeks to be read or written. However, SSDs can also have some adverse interaction with varying blocksizes. This is dependent on the internal SSD architecture and is due to over optimizing write performance.

With an SSD, you erase a block of NAND and write a page or sector of NAND at a time. As writes takes much longer than reads, many SSD vendors add parallelism to improve write throughput. Parallelism writes or programs multiple sectors at the same time. Thus, if your blocksize is an integral multiple of the multi-sector size written performance is great, if not, performance can suffer.

In all honesty, similar issues exist with hard drive sector sizes. If your blocksize is an integral multiple of the drive sector size then performance is great, if not too bad. In contrast to SSDs, drive sector size is often configurable at the device level.

Sequential vs. random IO

Hard drives perform sequential IO much better than random IO. For SSDs this is not much of a problem, as once wear leveling kicks in, it’s all random to the NAND flash. So when comparing hard drives to SSDs the level of sequentiality is a critical parameter to control.

Cache hit rate

The block inter-reference interval is simply measures how often the same block is re-referenced. This is important for caching devices and systems because it ultimately determines the cache hit rate (reading data directly from cache instead of the device storage). Hard drives have onboard cache of 8 to 32MB today. SSD drives also have a DRAM cache for data buffering and other uses. SSDs typically publicize their cache size so in order to insure 0 cache hits one needs an block inter-reference interval close to the device’s capacity. Not a problem today with 146GB devices but as they move to 300GB and larger it becomes more of a problem to completely characterize device performance.

The future

So how do we get a handle on SSD performance? SNIA and others are working on a specification on how to measure SSD performance that will one day become a standard. When the standard is available we will have benchmarks and service groups that can run these benchmarks to validate SSD vendor performance claims. Until then – caveat emptor.

Of course most end users would claim that device performance is not as important as (sub)system performance which is another matter entirely…

What if there were no backup?

Data Center by Mathieu Ramage (Flickr)
Data Center by Mathieu Ramage (Flickr)
If backup didn’t exist and you had to start over to protect your data how would you do it today?

I think four things are important to protect data in today’s data center:

  • Any data ever created in the data center or on-the-road needs to be protected,
  • Data restores must be under end-user control,
  • Data needs to be copied/replicated/mirrored offsite to support disaster recovery,
  • Multiple data copies should exist only to satisfy some data protection policy – one copy is mandatory, two copies (not co-located) would be required to support higher availability, and
  • Data protection activities should not interfere with or interrupt ongoing data center operations

All this can and is being done with backup and other systems today but most of these products and features grew out of earlier phases of computing. With today’s technology many of these capabilities may no longer be necessary today if one could just rethink data protection from the ground up.

Data Versioning

I think some form of data/file/block easily versioning could easily support the requirement of restoring any data ever created. Versioning systems have existed in the past and could certainly be re-constituted today with some sort of standards. The cost of storing all that data might be a concern but storage costs continue to decrease and if multiple copies retained for data protection can be eliminated, it might just be a wash. Versioning could just as easily be provided for the labtop and once new versions of data are created old versions could be moved off the laptop to the data center for safekeeping and to free up space.

End-user visiblility

End-user restoration requires some facility to explore the end-users data protection file-name and block space. Once this is available, identifying which version needs to be restored and where to restore it should be straightforward. All backup applications provide a backup directory and a few even allow end-user access to perform data restores. While all this works well with files, having an end-user do this for block storage would require more sophistication. Nonetheless, both file and block restores seems entirely feasible once data versioning is in place.

Ubiquitous replication

The requirement to have data copies offsite is certainly feasible today. Replication can be done in hardware or software today, synchronously, semi-synchronously, and/or asynchronously. Replication today can solve this problem but replicating to separate data centers cost too much. Enter the storage cloud. With the storage cloud we could pay just for the data bandwidth and storage to support our data protection needs and no more. Old data versions could be replicated as new versions are created. Protecting data written to a new version is more problematic but some sort of write splitter (ala CDP) could be used to create a replica of this data as well.

Policy driven

Having a policy driven data protection system that only stores a minimal number of copies of data seems to be difficult to support. Yet, this seems to be what incremental-only backup software and archive products support today. For other backup software, if one uses a deduplicating VTL this can be very similar. Adding some policy sophistication to coordinate multiple data protection copies across multiple (potentially Cloud) nodes and deduplicating all the un-necessary copies seems entirely feasible.

Operationally transparent

Not interrupting ongoing operations also seems to be tough to crack. Yet, many storage vendors provide snapshot technologies that copy block and/or file data without interrupting operations. However, coordinating vendor snapshot technologies from some central data protection manager is an essential integration but continues to be lacking.

Can pieceparts solve the problem?

Yes, most of these features are purchasable as separate product offerings (except data versioning) but what’s missing is any one product that pulls all of this together and offers one integrated solution to data protection as I have described it.

The problem, of course, is that such functionality probably best belongs as part of the O/S or a hypervisor but they long ago relinquished any responsibility for data protection. Aside from the anti-trust and non-competitive nature of such a future data protection O/S offering, I only see isolated steps and no coordinated attack on today’s overall data protection problem.

Backup software vendors do a great job with what they have under their control, but they can’t do it all, ditto for VTL providers, CDP vendors, replication products, etc. Piecemeal solutions can only take us so far down this path but it’s all we have today and I fear for the forseeable future.

Dream time over for now, gotta backup some data…