Cloud Storage Gateways Surface

Who says there are no clouds today by akakumo (cc) (from Flickr)
Who says there are no clouds today by akakumo (cc) (from Flickr)

One problem holding back general purpose cloud storage has been the lack of a “standard” way to get data in and out of the cloud.  Most cloud storage providers supply a REST interface, an object file interface or other proprietary ways to use their facilities.  The problem with this is that they all require some form of code development on the part of the cloud storage customer in order to make use of these interfaces.

It would be much easier if cloud storage could just talk iSCSI, FCoE, FC, NFS, CIFS, FTP,  etc. access protocols.  Then any data center could use the cloud with a NIC/HBA/CNA and just configure the cloud storage as a bunch of LUNs or file systems/mount points/shares.  Probably FCoE or FC might be difficult to use due to timeouts or other QoS (quality of service) issues but iSCSI and the file level protocols should be able to support cloud storage access without such concerns.

So which cloud storage support these protocols today?  Nirvanix supports CloudNAS used to access their facilities via NFS, CIFS and FTP,  ParaScale supports NFS and FTP, while Amazon S3 and Rackspace CloudFiles do not seem to support any of these interfaces.  There are probably other general purpose cloud storage providers I am missing here but these will suffice for now.   Wouldn’t it be better if some independent vendor supplied one way to talk to all of these storage environments.

How can gateways help?

For one example, Nasuni recently emerged from stealth mode, releasing a beta version of a cloud storage gateway that supports file access to a number of providers. Currently, Nasuni supports CIFS file protocol as a front end for Amazon S3, IronMountain ASP, Nirvanix, and coming soon Rackspace CloudFile.

However, Nasuni is more than just a file protocol converter for cloud storage.  It also supplies a data cache, file snapshot services, data compression/encryption, and other cloud storage management tools. Specifically,

  • Cloud data cache – their gateway maintains a disk cache of frequently accessed data that can be accessed directly without having to go out to the cloud storage.  File data is chunked by the gateway and flushed out of cache to the backend provider. How such a disk cache is maintained coherently across multiple gateway nodes was not discussed.
  • File snapshot services – their gateway supports a point-in-time copy of file date used for backup and other purposes.  The snapshot is created on a time schedule and provides an incremental backup of cloud file data.  Presumably these snapshot chunks are also stored in the cloud.
  • Data compression/encryption services – their gateway compresses file chunks and then encrypts it before sending them to the cloud.  Encryption keys can optionally be maintained by the customer or be automatically maintained by the gateway
  • Cloud storage management services – the gateway configures the cloud storage services needed to define volumes, monitors cloud and network performance and provides a single bill for all cloud storage used by the customer.

By chunking the files and caching them, data read from the cloud should be accessible much faster than normal cloud file access.  Also by providing a form of snapshot, cloud data should be easier to backup and subsequently restore. Although Nasuni’s website didn’t provide much information on the snapshot service, such capabilities have been around for a long time and found very useful in other storage systems.

Nasuni is provided as a software only solution. Once installed and activated on your server hardware, it’s billed for as a service and ultimately is charged for on top of any cloud storage you use.  You sign up for supported cloud storage providers through Nasuni’s service portal.

How well all this works is open for discussion.  We have discussed caching appliances before both from EMC and others.  Two issues have emerged from our discussions, how well caching coherence is maintained across nodes is non-trivial and the economics of a caching appliance are subject to some debate.  However, cloud gateways are more than just caching appliances and as a way of advancing cloud storage adoption, such gateways can only help.

Full disclosure: I currently do no business with Nasuni.

Does cloud storage need backup?

I was talking with a cloud storage vendor the other day and they made an interesting comment, cloud storage doesn’t need to backup data?! They told me that they and most cloud storage providers replicate customer file data so that there is always at least two (or more) copies of customer data residing in the cloud at different sites, zones or locations. But does having multiple copies of file data eliminate the need to backup data?

Most people backup data to prevent data loss from hardware/software/system failures and from finger checks – user error. Nowadays, I backup to a external hard disk nightly for my business stuff, add some family stuff to this and backup all this up once a week to external removable media, and once a month take a full backup of all user data on my family Mac’s (photos, music, family stuff, etc.) to external removable media which is then saved offsite.

For my professional existence (30+ years) I have lost personal data from a hardware/software/system failure maybe a dozen times. These events have gotten much rarer in recent history (thank you drive vendors). But about once a month I screw something up and delete or overwrite a file I need to keep around. Most often I restore from the hard drive but occasionally use the removable media to retrieve the file.

I am probably not an exception with respect to finger checks. People make mistakes. How cloud storage providers handle restoring deleted file data for user error will be a significant determinant of service quality for most novice and all professional users.

Now in my mind there are a couple of ways cloud storage providers can deal with this problem.

  • Support data backup, NDMP, or something similar which takes a copy of the data off the cloud and manages it elsewhere. This approach has worked for the IT industry for over 50 years now and still appeals to many of us.
  • Never “really” delete file data, by this I mean that you always keep replicated copies of all data that is ever written to the cloud. How a customer accesses such “not really deleted” data is open to debate but suffice it to say some form of file versioning might work.
  • “Delay” file deletion, don’t delete a file when the user requests it, but rather wait until some external event, interval, or management policy kicks in to “actually” delete the file from the cloud. Again some form of versioning may be required to access “delay deleted” data.

Never deleting a file is probably the easiest solution to this the problem but the cloud storage bill would quickly grow out of control. Delaying file deletion is probably a better compromise but deciding which event, interval, or policy to use to trigger “actually deleting data” to free up storage space is crucial.

Luckily most people realize when they have made a finger check fairly quickly (although may be reluctant to admit it). So waiting a week, month, or quarter before actually deleting file data would work to solve with this problem. Mainframers may recall generation datasets (files) where one specified the number of generations (versions) of a file and when this limit was exceeded, the oldest version would be deleted. Also, using some space threshold trigger to delete old file versions may work, e.g., whenever the cloud gets to be 60% of capacity it starts deleting old file versions. Any or all of these could be applied to different classes of data by management policy.

Of course all of this is pretty much what a sophisticated backup package does today. Backup software retains old file data around for a defined timeframe, typically on some other media or storage than where the data is normally stored. Backup storage space/media can be reclaimed on a periodic basis such as reusing backup media every quarter or only retaining a quarters worth of data in a VTL. Backup software removes the management of file versioning from the storage vendor and places it in the hands of the backup vendor. In any case, many of the same policies for dealing with deleted file versions discussed above can apply.

Nonetheless, in my view cloud storage providers must do something to support restoration of deleted file data. File replication is a necessary and great solution to deal with hardware/software/system failures but user error is much more likely. Not supplying some method to restore files when mistakes happen is unthinkable.