Proximal Data, server SSD caching software

7707062406_6508dba2a4_oI attended Storage Field Day 4 (SFD4) about a month ago now and had a chance to visit with Rory Bolt, CEO/Founder of Proximal Data, a new server side caching software solution. Last month the GreyBeards (Howard Marks and I) talked with Satyam Vaghani, Co-founder and CTO of PernixData another server side caching solution. You can find that podcast here. But this post is about Proximal Data. These guys could use some better marketing but when you spend 90% of your funding on engineers this is what you get.

Proximal Data doesn’t believe in agent software. because it takes a long time to deploy and could potentially disrupt IT operations when being installed. In contrast, Proximal Data installs their AutoCache solution software into the hypervisor as a VIB (vSphere Installation Bundle). There was some discussion at SFD4, on whether installing the VIB would be disruptive or not to customer operations. Not being a VMware expert I won’t comment on the results of the discussion but if you want to find out more I suggest viewing the SFD4 video of their Proximal Data’s presentation.

Of course, being at the Hypervisor layer can give them IO activity information at the VM level and could use this to control their caching software at VM granularity. In addition, by executing at the Hypervisor layer AutoCache doesn’t require any guest OS specific functionality or hooks. Another nice thing about executing at the hypervisor level is that they can cache RDM devices.

To use AutoCache you will need one or more PCIe SSD(s) or DAS SSD(s) in your ESXi server.  Once the PCIe SSD or DAS SSD is installed and after you have installed/activated the AutoCache software you will need to partition or dedicate the device to Proximal Data’s AutoCache.

AutoCache is managed as a virtual appliance with a Web server GUI.  With the networking setup and AutoCache VIB, installed you can access their operator panels via a tab in vCenter. Once the software is installed you don’t have to use their GUI ever again.

AutoCache read caching algorithms

Not every read IO for a VM being cached is brought into AutoCache’s SSD cache. They are trying to insure that cached data will be referenced again. As such, they typically wait for two reads before the data is placed into cache.

They support two different read caching algorithms called during the presentation as Algorithm A or Algorithm B. (They really need some marketing – Turbo Boost and Extreme Boost sounds better to me). Not sure they ever described the differences between the two, but the fact that they have multiple caching algorithms speaks to some sophistication. They also maintain a “Ghost data list”. Ghost data is data whose metadata is still in cache, but whose actual data is no longer in cache.

When a miss occurs, they determine if the data would have been a hit in Ghost data, in Algorithm A or in Algorithm B if they were active on the VM.  If it would have been a hit in Ghost data then in general, you probably need more SSD caching space on this ESXi server for the VMs being cached. If Algorithm A or B, probably should be using that algorithm for this VM’s IO.

Another approach AutoCache supports is called “Glimmer IO”. I liken this to sequential read-ahead where AutoCache keeps track, on a VM basis, all the IO being performed and try to determine if it is sequential or random. If the VM is doing sequential IO, AutoCache can start reading ahead of where the VM is currently reading. By doing so, they could stage the data in cache before the VM needs it/reads it. According to Rory there are policies which can be set on a VM basis to limit how much read-ahead is being performed. I assume there are policies associated with the use of Algorithm A and B on a VM basis as well but they didn’t go into this.

AutoCache cache warmup for vMotion

The other nice thing that AutoCache does is it provides a cache warmup for the target ESXi server when moving VMs via vMotion. This is done by registering for Vmotion API and trapping Vmotion requests. Once they detect that a VM is being moved they send the VM’s  Autocache metadata over to the target Host at which time the target system AutoCache can start to fill it’s cache from the shared storage. Not a bad approach from my perspective. The amount of data that needs to be moved is minimal and you get the AutoCache code running in the target machine to start preloading blocks that were in cache from the source Host. They also mentioned that once they have copied the metadata over to the target Host, they could free up (invalidate) all the space in the source Host’s cache that was being held by the VM being moved.

Proximal Data for Hyper-V

At SFD4, Rory mentioned that a Hyper-V version of AutoCache was coming out shortly. And although they specifically indicated that write back caching was not a great idea (in contrast to Satyam and PernixData), there was a potential for them to look at implementing this as well over time.

The product is sold through resellers, distributors and OEMs.  They claim support for any flash device although they have an approved HCL.

Current pricing is $1000 for the AutoCache software to support a SSD cache of 500GB or less. From what we see in the enterprise storage systems having a cache of 2-5% of your total backend storage is about right. (But see my VM working set inflection points and  SSD caching post for another side on this).   So a 500GB SSD cache should be able to support 10-25TB of backend data if all goes well.


After the podcast on PernixData’s clustering, write-back caching software, Proximal Data didn’t seem as complex or useful. But there is a place for read-only caching. The fact that they can help warm the target Host’s cache for a vMotion is a great feature if you plan on doing a lot of movement of VMs in your shop. The fact that they have distinct support for multiple cache algorithms, understand sequential detect and have some way of telling you that you could use more SSD caching is also good in my mind.


Photo: 20-nanometer NAND flash chip, IntelFreePress’ photostream



Caching DaaD for federated data centers

Internet Splat Map by jurvetson (cc) (from flickr)
Internet Splat Map by jurvetson (cc) (from flickr)

Today, I attended a webinar where Pat Gelsinger, President of Information Infrastructure at EMC discussed their concept for a new product based on the Yotta Yotta technology they acquired a few years back.  Yotta Yotta’s product was a distributed, coherent caching appliance that had FC front end ports, an Infiniband appliance internal network and both FC and WAN backend links.

What one did with Yotta Yotta nodes was place them in front of your block storage, connect them together via infiniband locally and via a WAN technology (of your choice, then) and then you could access any data behind the appliances from any attached location.  They also provided very quick transferring of bulk data between remote nodes. So, their technology allowed for very rapid data transmission over standard WAN interfaces/distances and provided a distributed cache across those very same distances to the data behind the appliances.

I like caching appliances as much as anyone but they had become prominent only in the late 70’s and early 80’s mostly because caching technology was hard to do with the storage subsystems of the day, but they went away a long time ago.  Nowadays, you can barely purchase a lone disk drive without a cache in them.  So what’s different.

Introducing DaaD

Today we have SSDs and much cheaper processing power.  I wrote about new caching appliances like DataRam‘s XcelaSAN  in a Cache appliances rise from the dead post I did after last years SNW.  But EMC’s going after a slightly broader domain – the world.  The caching appliance that EMC is discussing is really intended to support distributed data access, or as I like to call it,  Data-at-a-Distance (DaaD).

How can this work?  Data is stored on subsystems at various locations around the world.  A DaaD appliance is inserted in front of each of these and connected over the WAN. Some or all of that data is now re-configured (at block or more likely LUN level) to be accessible at distance from each DaaD data center.  As each data center reads and writes data from/to their remote brethern, some portion of that data is cached locally in the DaaD appliance and the rest is only available by going to the remote site (with considerably higher latency).

This works moderately well for well behaved, read intensive workloads where 80% of the IO is to 20% of the data (most of which is cached locally).  But block writes present a particularly nasty problem as any data write has to be propagated to all cache copies before acknowledged.

It’s possible write propagation could be done via invalidating the data in cache (so any subsequent read would need to re-access the data from the original host).  Nevertheless, to even know which DaaD nodes have a cached copy of a particular block, one needs to maintain a dictionary of all globally identifiable blocks held in any DaaD cache node at every moment in time.  Any such table would change often, will necessarily need to be updated very carefully, deadlock free and atomically with non-failable transactions – therein lies one of the technological hurdles.  Doing this quickly without impacting performance is another hurdle.

So simple enough, EMC takes Yotta Yotta’s technology, updates it for todays processors, networking, and storage, and releases it as a data center federation enabler. So, what can one do with a federated data center, well that’s another question and it involves Vmotion, and must be a subject for a future post …

VMworld and long distance Vmotion

Moving a VM from one data center to another

In all the blog posts/tweets about VMworld this week I didn’t see much about long distance Vmotion. At Cisco’s booth there was a presentation on how they partnered with VMware and to perform Vmotion over 200 (simulated) miles away.

I can’t recall when I first heard about this capability but for many of us this we heard about this before. However, what was new was that Cisco wasn’t the only one talking about it. I met with a company called NetEx whose product HyperIP was being used to performe long distance Vmotion at over 2000 miles apart . And had at least three sites actually running their systems doing this. Now I am sure you won’t find NetEx on VMware’s long HCL list but what they have managed to do is impressive.

As I understand it, they have an optimized appliance (also available as a virtual [VM] appliance) that terminates the TCP session (used by Vmotion) at the primary site and then transfers the data payload using their own UDP protocol over to the target appliance which re-constitutes (?) the TCP session and sends it back up the stack as if everything is local. According to the NetEx CEO Craig Gust, their product typically offers a data payload of around ~90% compared to standard TCP/IP of around 30%, which automatically gives them a 3X advantage (although he claimed a 6X speed or distance advantage, I can’t seem to follow the logic).

How all this works with vCenter, DRS and HA I can only fathom but my guess is that everything this long distance Vmotion is actually does appears to VMware as a local Vmotion. This way DRS and/or HA can control it all. How the networking is set up to support this is beyond me.

Nevertheless, all of this proves that it’s not just one highend networking company coming away with a proof of concept anymore, at least two companies exist, one of which have customers doing it today.

The Storage problem

In any event, accessing the storage at the remote site is another problem. It’s one thing to transfer server memory and state information over 10-1000 miles, it’s quite another to transfer TBs of data storage over the same distance. The Cisco team suggested some alternatives to handle the storage side of long distance Vmotion:

  • Let the storage stay in the original location. This would be supported by having the VM in the remote site access the storage across a network
  • Move the storage via long distance Storage Vmotion. The problem with this is that transferring TB of data takes (even at 90% data payload for 800 Mb/s) would take hours. And 800Mb/s networking isn’t cheap.
  • Replicate the storage via active-passive replication. Here the storage subsystem(s) concurrently replicate the data from the primary site to the secondary site
  • Replicate the storage via active-active replication where both the primary and secondary site replicate data to one another and any write to either location is replicated to the other

Now I have to admit the active-active replication where the same LUN or file system can be be being replicated in both directions and updated at both locations simultaneously seems to me unobtainium, I can be convinced otherwise. Nevertheless, the other approaches exist today and effectively deal with the issue, albeit with commensurate increases in expense.

The Networking problem

So now that we have the storage problem solved, what about the networking problem. When a VM is Vmotioned to another ESX server it retains its IP addressing so as to retain all it’s current network connections. Cisco has some techniques here where they can seem to extend the VLAN (or subnet) from the primary site to the secondary site and leave the VM with the same network IP address as at the primary site. Cisco has a couple of different ways to extend the VLAN optimized for HA, load ballancing, scaleability or protocol isolation and broadcast avoidance. (all of which is described further in their white paper on the subject). Cisco did mention that their Extending VLAN technology currently would not support distances greater than 500 miles apart.

Presumably NetEx’s product solves all this by leaving the IP addresses/TCP port at the primary site and just transferring the data to the secondary site. In any event multiple solutions to the networking problem exist as well.

Now, that long distance Vmotion can be accomplished is it a DR tool, a mobility tool, a load ballancing tool, or all of the above. That will need to wait for another post.