Read an article in ScienceDaily (Achieving greater efficiency for fast datacenter operations) today that discussed some research done at MIT CSAIL to be presented next week at NSDI’19 discussing Shenango, a new algorithm to allocate idle CPU cores to process latency sensitive transaction workloads. The paper is to be presented on February 27th. (I may update this with more details on Shenango after the paper is published)
t appears that for many web-scale applications, response time is driven mostly by tail latencies (slowest service determines web page response). For these 10K-100K server environments, they have always had to over provision CPU cores to support reducing service tail latency. This has led to 100s to 1000s of cores, mostly sitting idle (but powered on) for much of the time.
here’s been some solutions that try to better use idle cores, but their core allocation responsiveness has been in the milliseconds. With 10-100s of threads that make up web service , allocating CPU resources in milliseconds was too slow
Arachne, a core aware thread scheduler
One approach to better core allocation uses Arachne: Core Aware Thread Management, out of Stanford.
With Arachne, threads are assigned to an application and each is given a priority. Arachne attempts to schedule them in priority order across an array of cores at its disposal.
Arachne’s Core Arbiter code is what assigns application threads to cores and runs under Linux at the user level. Some of its timings seem pretty fast. In the paper cited above, Arachne was able to schedule a thread to a core in under 300nsec.
Under Arachne, there are two sets of cores, managed and unmanaged cores and applications. Unmanaged cores run normal (non-Arachne, unmanaged) applications and threads. Managed cores or applications use Arachne to assign cores.
Arachne uses a Linux construct called cpusets, a collection of cores and memory banks, to allocate resources to run application threads. Cores and memory banks move between managed and unmanaged based on applications being run. Arachne assumes that managed apps have higher priority than unmanaged apps.
That is at the start of Arachne, all cores exist in the unmanaged set. The Core Arbiter executes here as well. As applications are scheduled to run, the Arbiter grabs cpusets from unmanaged applications or a free pool and assigns them to run application threads. When the application completes the cpusets are returned to the unmanaged pool.
Arachne allocates cores based on a priority scheme with 8 levels. Highest priority managed applications/threads get cpusets first, lower priority managed application threads next, and unmanaged applications last
There’s a set of APIs that applications must use to request and free cores when no longer in use. Arachne seems pretty general purpose, and as it operates with both normal (unmanaged) Linux applications as well as (Arachne) managed applications is appealing.
Shenango core allocation
Not much technical information on Shenango was available as we published this post, but their is some information in the MIT/ScienceDaily piece and some in the Arachne paper.
It appears as if Shenango detects applications suffering from high tail latency by interfacing with the network stack and seeing if packets have been waiting to be processed. It does this every 5 usecs and if a packet has been waiting since last time, it’s considered a candidate for more cores, has tail latency problems and is congested.
IIt seems to do the same for computational processes that have been waiting for some service response. Shenango implements an IOKernel that handles core allocation to apps. Shenango IO
Shenango apps use an API to indicate when they are not processing time sensitive services and when they are. If they are not, their cores can be released to more time sensitive apps that are encountering congestion
Presumably Shenango does not execute at the user level. And it’s unclear whether it can operate with both (Linux) normal and Shanango managed applications. And it also appears to be tied tightly to the network stack. Whether any of this matters to web-scale application users/developers is subject to debate.
However, the fact that it only alters core allocations when applications are congested seems a nice feature.
The Arachne paper said it “improved SLO MemCached by 37% and reduced tail latency by 10X” . The only metric available in the Shenango discussion was that they increased typical web-scale server CPU core allocation from 60% to 100%
f Shenango or Arachne can reduce over provisioning of CPU cores and memory, it could lead to significant energy and server savings. Especially for customers running 10K servers or more.