Thinly provisioned compute clouds

Thin provisioning has been around in storage since StorageTek’s Iceberg hit the enterprise market in 1995.  However, thin provisioning has never taken off for system servers or virtual machines (VMs).

But recently a paper out of MIT Making cloud computing more efficient discusses some recent research that came up with the idea of monitoring system activity to model and predict application performance.

So how does this enable thinly provision VMs?

With a model like this in place, one could concievably provide a thinly provisioned virtual server that could guarantee a QoS and still minimize resource consumption.  For example, have the application VM just consume the resources needed at any instant in time which could be adjusted as demands on the system change.  Thus, as an application  needs grew, more resources could be supplied and as needs shrink, resources could be given up for other uses.

With this sort of server QoS, certain classes of application VMs would need to have variable or no QoS to be sacrificed in times of need to those that required guaranteed QoS. But in a cloud service environment a multiplicity of service classes like these could be supplied at different price points.

Thin provisioning grew up in storage because it’s relatively straightforward for a storage subsystem to understand capacity demands at any instant in time.  A storage system only needs to monitor data write activity and if a data block was written or consumed then it would be backed by real storage. If it had never been written, then it was relatively easy to fabricate a block of zeros if it ever was read.

Prior to thinly provisioned storage, fat provisioning required that storage be configured to the maximum capacity required of it. Similarly, with fully (or fat) provisioned VMs, they must be configured for peak workloads. With the advent of thin provisioning on storage wasted resources (capacity in the case of storage) could be shared across multiple thinly provisioned volumes (LUNs) thereby freeing up these resources for other users.

Problems with server thin provisioning

I see some potential problems with the model and my assumptions as to how thinly provisioned VM would wore. First, the modeled performance is a lagging indicator at best.  Just as system transactions start to get slower, a hypervisor would need to interrupt the VM to add more physical (or virtual) resources.  Naturally during the interruption system performance would suffer.

It would be helpful if resources could be added to a VM dynamically, in real time without impacting the applications running in the VM. But it seems to me that adding physical or virtual CPU cores,  memory, bandwidth, etc., to a VM would require at least some sort of interruption to a pair of VMs [the one giving up the resource(s) and the one gaining the freed up resource(s)].

Similar issues occur for thinly provisioned storage. As storage is consumed for a thinly provisioned volume, allocating more physical capacity takes some amount of storage subsystem resources and time to accomplish.

How does the model work?

It appears that the software model works by predicting system performance based on a limited set of measurements. Indeed, their model is bi-modal. That is there are two approaches:

  • Black box model – tracks server or VM indictors such as “number and type of user requests” as well as system performance and uses AI to correlate the two. This works well for moderate fluctuations in demand but doesn’t help when requests for services falls beyond those boundaries.
  • Grey box model – is more sophisticated and is based on an understanding of a specific database functionality, such as how frequently they flush host buffers, commit transactions to disk logs, etc.  In this case, they are able to predict system performance when demand peaks at 4X to 400X current system requirements.

They have implemented the grey box model for MySQL and are in the process of doing the same for PostGres.

Model validation and availability

They tested their prediction algorithm against published TPC-C benchmark results and were able to come within 80% accuracy for CPU use and 99% accuracy for disk bandwidth consumption.

It appears that the team has released their code as open source. At least one database vendor, Teradata is porting it over to their own database machine to better allocate physical resources to data warehouse queries.

It seems to me that this would be a natural for cloud compute providers and even more important for hypervisor solutions such as vSphere, Hyper-V, etc.  Anyplace one could use more flexibility in assigning virtual or physical resources to an application or server would find use for this performance modeling.

~~~~

Now, if they could just do something to help create thinly provisioned highways, …

Image: Intel Team Inside Facebook Data Center By IntelFreePress