
Read an article in ScienceDaily (Achieving greater efficiency for fast datacenter operations) today that discussed some research done at MIT CSAIL to be presented next week at NSDI’19 discussing Shenango, a new algorithm to allocate idle CPU cores to process latency sensitive transaction workloads. The paper is to be presented on February 27th. (I may update this with more details on Shenango after the paper is published)
t appears that for many web-scale applications, response time is driven mostly by tail latencies (slowest service determines web page response). For these 10K-100K server environments, they have always had to over provision CPU cores to support reducing service tail latency. This has led to 100s to 1000s of cores, mostly sitting idle (but powered on) for much of the time.
here’s been some solutions that try to better use idle cores, but their core allocation responsiveness has been in the milliseconds. With 10-100s of threads that make up web service , allocating CPU resources in milliseconds was too slow
Arachne, a core aware thread scheduler

One approach to better core allocation uses Arachne: Core Aware Thread Management, out of Stanford.
With Arachne, threads are assigned to an application and each is given a priority. Arachne attempts to schedule them in priority order across an array of cores at its disposal.
Arachne’s Core Arbiter code is what assigns application threads to cores and runs under Linux at the user level. Some of its timings seem pretty fast. In the paper cited above, Arachne was able to schedule a thread to a core in under 300nsec.
Under Arachne, there are two sets of cores, managed and unmanaged cores and applications. Unmanaged cores run normal (non-Arachne, unmanaged) applications and threads. Managed cores or applications use Arachne to assign cores.
Arachne uses a Linux construct called cpusets, a collection of cores and memory banks, to allocate resources to run application threads. Cores and memory banks move between managed and unmanaged based on applications being run. Arachne assumes that managed apps have higher priority than unmanaged apps.
That is at the start of Arachne, all cores exist in the unmanaged set. The Core Arbiter executes here as well. As applications are scheduled to run, the Arbiter grabs cpusets from unmanaged applications or a free pool and assigns them to run application threads. When the application completes the cpusets are returned to the unmanaged pool.
Arachne allocates cores based on a priority scheme with 8 levels. Highest priority managed applications/threads get cpusets first, lower priority managed application threads next, and unmanaged applications last
There’s a set of APIs that applications must use to request and free cores when no longer in use. Arachne seems pretty general purpose, and as it operates with both normal (unmanaged) Linux applications as well as (Arachne) managed applications is appealing.
Shenango core allocation

Not much technical information on Shenango was available as we published this post, but their is some information in the MIT/ScienceDaily piece and some in the Arachne paper.
It appears as if Shenango detects applications suffering from high tail latency by interfacing with the network stack and seeing if packets have been waiting to be processed. It does this every 5 usecs and if a packet has been waiting since last time, it’s considered a candidate for more cores, has tail latency problems and is congested.
IIt seems to do the same for computational processes that have been waiting for some service response. Shenango implements an IOKernel that handles core allocation to apps. Shenango IO
Shenango apps use an API to indicate when they are not processing time sensitive services and when they are. If they are not, their cores can be released to more time sensitive apps that are encountering congestion
Presumably Shenango does not execute at the user level. And it’s unclear whether it can operate with both (Linux) normal and Shanango managed applications. And it also appears to be tied tightly to the network stack. Whether any of this matters to web-scale application users/developers is subject to debate.
However, the fact that it only alters core allocations when applications are congested seems a nice feature.
~~~~
The Arachne paper said it “improved SLO MemCached by 37% and reduced tail latency by 10X” . The only metric available in the Shenango discussion was that they increased typical web-scale server CPU core allocation from 60% to 100%
f Shenango or Arachne can reduce over provisioning of CPU cores and memory, it could lead to significant energy and server savings. Especially for customers running 10K servers or more.








We attended a
The challenge that Infinidat has is how to perform as well as (or better than) an all flash array when you have hybrid flash – disk storage.
Fortunately, DRAM has a random access time of ~100 nsec which is 3 orders of magnitude better than flash. A manager I had once said everyone wants their data stored on tape but accessed out of memory.
So, for a storage systems in open system environments to average an 90% DRAM read hit cache hit rate is unheard of, and only seen for brief intervals at best, with especially well behaved applications and not under virtualization. For a customer to see an average DRAM hit rate exceeding 90%, over the course of multiple days was inconceivable,.
It all starts with writes. When data is written to Infinidat, it records a terrain map of all the other data that has been written recently. I suppose one could think of this as a 2 dimensional map, with spots on the map being the equivalent of data in the storage system that have recently been written. This map changes over time so it’s more like a movie stream of frames showing, from frame to frame all the recently written data in the system at any point in time. Of course the frame rate for this stream is the IO rate.
In any case, any system could do this sort of caching algorithm, iff they had the processing power needed, had the metadata layout which made recording the IO stream frame by frame space efficient, had the metadata indexing which would enable them to locate the last frame a record was written in AND had the IO parallelism required to do a whole lot of IO all the time to keep that DRAM cache filled with hit candidates. Did I mention that Infinidat uses a three controller storage system, unlike the rest of the industry that uses a two controller system. This gives them 50% more horse power and data paths to get data into cache.
Read an article this week in Science Daily (
Skyrmions and chiral bobbers are both considered magnetic solitons, types of magnetic structures only 10’s of nm wide, that can move around, in sort of a race track configuration.
Delay line memories
Previously, the thought was that one would encode digital data with only skyrmions and spaces. But the discovery of chiral bobbers and the fact that they can co-exist with skyrmions, means that chiral bobbers and skyrmions can be used together in a racetrack fashion to record digital data. And the fact that both can move or migrate through a material makes them ideal for racetrack storage.
Read an article this week in Science Daily (
Navion uses a state of the art, non-linear factor graph optimization algorithm to navigate in space. It doesn’t sound like DL neural net image recognition but more like a statistical/probabilistic approach to image mapping and place estimation. The chip uses image compression, two stage memory, and sparse linear solver memory to reduce image processing memory requirements from 3.5MB to less than 1MB.
The table shows comparisons of the Navion chip against other traditional navigational systems that use Xeon, ARM or FPGA chips. As far as I can tell it’s either much better or at least on a par with these other larger, more complex, power hungry systems.
Read an article (
Their
The
Courses like these should be much more widely available. It’s almost the analog to the scientific method, only for the 21st century.
Read an article today from Scientific American (
There’s a lab at ASU (Arizona State University) that chemically analyzes samples of wastewater to determine the amount of drugs that a city’s population excretes. They can provide a near real-time assessment of the proportion of drugs in city sewage and thereby, in a city’s population.
In addition, by sampling sewage at a neighborhood level, one can gain an assessment of drug problems at any sub-division of a city that’s needed.