
Read an article today from ZDnet called Data center scheduling as easy as watching a movie. It was about research out of Stanford University that shows how using short glimpses of applications in operation can be used to optimally determine the best existing infrastructure to run it on (for more info, see “Quasar: Resource-Efficient and QoS-Aware Cluster Management” by Christina Delimitrou and Christos Kozyrakis).
What with all the world’s compute moving to the cloud, the cloud providers are starting to see poor CPU utilization. E.g., AWS’s EC2 average server utilization is typically between 3 and 17%, Google’s is between 25-25% and Twitter’s is consistently below 20%, source: paper above. Such poor utilization at cloud scale causing them to lose a lot of money.
Most cloud organizations and larger companies these days have myriad of servers they have acquired over time. These servers often range from the latest multi-core behemoths, to older servers that have seen better days.
Nonetheless, as new applications come into the mix, it’s hard to know whether they need the latest servers or could get by just as well with some older equipment that happens to be lying around idle in the shop. Because of this inability to ascertain the best infrastructure to run them on, it often leads to over provisioning/under utilization that we see today.
A better way to manage clusters
This is the classic problem that is trying to be solved by cluster management. There are essentially two issues in cluster management for new applications:
- What resources the application will need to run,
- Which available servers can best satisfy the application’s resource requirements,
The first issue is normally answered by the application developer/deployer which they get to specify. When they get this wrong the applications run on severs with more resources than needed which end up being lightly utilized.
But if there was a way to automate the first step in this process?
It turns out if you run a new application for a short time you can determine its execution characteristics. Then if you coluld search a database of applications currently running on your infrastructure you could match how the new application runs with how current applications run and determine a pseudo-optimum fit for the best place to run the new application.
Such a system would need to monitor current applications and determine its server resource usage, e.g., memory use, IO activity, CPU utilization, etc. in your shop. The system would need to construct and maintain a database of applications to server resource utilization. Also, somewhere you would need a database of current server resources in your cluster.
But if you have all that in place, it seems like you could have a solution to the classic cluster management problem presented above.
What about performance critical apps
There’s a class of applications that have stringent QoS requirements that go beyond optimal runtime execution characteristics (latency/throughput sensitive workloads). These applications must be run in environments that can guarantee their latency requirements can be met. This may not be the most optimal location from a cluster perspective but it may be the only place it can run and meet its service objectives.
So any cluster management optimization would also need to factor in such application QoS requirements into its decision matrix on where to run new applications.
Quasar cluster management
The researchers at Stanford have implemented the Quasar cluster management solution which does all that. Today it provides
- A way for users to specify application QoS requirements for those applications that require special services,
- It takes and runs new applications quickly to ascertain it’s resource requirements and quickly classify its characteristics against a database of currently running applications, and
- It allocates new applications to the optimal server configurations that are available.
The paper cited above shows results from using Quasar cluster management on hadoop clusters, memcached and Cassandra clusters, HotCRP clusters as well as a cloud environment. For the cloud environment Quasar has shown that it can boost server utilization for a 200 node cloud environment running 1200 workloads up to 65%.
The paper goes into more detail and there’s more information on Quasar available on Christina Delimitrou’s website.
~~~
Comments?