Quasar, data center scheduling reboot

Two people talking to one another in a data center hallway about one person wide with bunches of racks and cabling on either side
Microsoft Bing Maps’ datacenter by Robert Scoble

Read an article today from ZDnet called Data center scheduling as easy as watching a movie. It was about research out of Stanford University that shows how using short glimpses of applications in operation can be used to optimally determine the best existing infrastructure to run it on (for more info, see “Quasar: Resource-Efficient and QoS-Aware Cluster Management”  by Christina Delimitrou and Christos Kozyrakis).

What with all the world’s compute moving to the cloud, the cloud providers are starting to see poor CPU utilization. E.g., AWS’s EC2 average server utilization is typically between 3 and 17%, Google’s is between 25-25% and Twitter’s is consistently below 20%, source: paper above. Such poor utilization at cloud scale causing them to lose a lot of money.

Most cloud organizations and larger companies these days have myriad of servers they have acquired over time. These servers often range from the latest multi-core behemoths, to older servers that have seen better days.

Nonetheless, as new applications come into the mix, it’s hard to know whether they need the latest servers or could get by just as well with some older equipment that happens to be lying around idle in the shop. Because of this inability to ascertain the best infrastructure to run them on, it often leads to over provisioning/under utilization that we see today.

A better way to manage clusters

This is the classic problem that is trying to be solved by cluster management. There are essentially two issues in cluster management for new applications:

  • What resources the application will need to run,
  • Which available servers can best satisfy the application’s resource requirements,

The  first issue is normally answered by the application developer/deployer which they get to specify. When they get this wrong the applications run on severs with more resources than needed which end up being lightly utilized.

But if there was a way to automate the first step in this process?

It turns out if you run a new application for a short time you can determine its execution characteristics. Then if you coluld search a database of applications currently running on your infrastructure you could match how the new application runs with how current applications run and determine a pseudo-optimum fit for the best place to run the new application.

Such a system would need to monitor current applications and determine its server resource usage, e.g., memory use, IO activity, CPU utilization, etc. in your shop. The system would need to construct and maintain a database of applications to server resource utilization. Also, somewhere you would need a database of current server resources in your cluster.

But if you have all that in place, it seems like you could have a solution to the classic cluster management problem presented above.

What about performance critical apps

There’s a class of applications that have stringent QoS requirements that go beyond optimal runtime execution characteristics (latency/throughput sensitive workloads). These applications must be run in environments that can guarantee their latency requirements can be met. This may not be the most optimal location from a cluster perspective but it may be the only place it can run and meet its service objectives.

So any cluster management optimization would also need to factor in such application QoS requirements into its decision matrix on where to run new applications.

Quasar cluster management

The researchers at Stanford have implemented the Quasar cluster management solution which does all that. Today it provides

  1. A way for users to specify application QoS requirements for those applications that require special services,
  2. It takes and runs new applications quickly to ascertain it’s resource requirements and quickly classify its characteristics against a database of currently running applications, and
  3. It allocates new applications to the optimal server configurations that are available.

 

The paper cited above shows results from using Quasar cluster management on hadoop clusters, memcached and Cassandra clusters,  HotCRP clusters as well as a cloud environment. For the cloud environment Quasar has shown that it can boost server utilization for a 200 node cloud environment running 1200 workloads up to 65%.

The paper goes into more detail and there’s more information on Quasar available on Christina Delimitrou’s website.

~~~

Comments?

IBM’s next generation, TrueNorth neuromorphic chip

Ok, I admit it, besides being a storage nut I also have an enduring interest in AI. And as the technology of more sophisticated neuromorphic chips starts to emerge it seems to me to herald a whole new class of AI capabilities coming online. I suppose it’s both a bit frightening as well as exciting which is why it interests me so.

IBM announced a new version of their neuromorphic chip line, called TrueNorth with +5B transistors and the equivalent of ~1M neurons. There were a number of articles on this yesterday but the one I found most interesting was in MIT Technical Review, IBM’s new brainlike chip processes data the way your brain does, (based on a Journal Science article requires login, A million spiking neuron integrated circuit with a scaleable communications network and interface).  We discussed an earlier generation of their SyNAPSE chip in a previous post (see my IBM research introduces SyNAPSE chip post).

How does TrueNorth compare to the previous chip?

The previous generation SyNAPSE chip had a multi-mode approach which used  65K “learning synapses” together with ~256K “programming synapses”. Their current generation, TrueNorth chip has 256M “configurable synapses” and 1M “programmable spiking neurons”.  So the current chip has quadrupled the previous chips “programmable synapses” and multiplied the “configurable synapses” by a factor of a 1000.

Not sure why the configurable synapses went up so high but it could be an aspect of connectivity, something akin to what happens to a “complete graph” which has a direct edge connection to every node in the graph. In a complete graph if you have N nodes then the number of edges is given as [N*(N-1)]/2, which for 1M nodes would be ~500M edges. So it must not be a complete graph, but it’s “close to complete” with 1/2 the number of edges.

Analog vs. Digital?

When last I talked with IBM on their earlier version chip I wondered why they used digital logic to create it rather than analog. They said to be able to better follow along the technology curve of normal chip electronics digital was the way to go.

It seemed to me at the time that if you really  wanted to simulate a brains neural processing then you would want to use an analog approach and this should use much less power. I wrote a couple of posts on the subject, one of which was on MIT’s analog neuromorphic chip (see my MIT builds analog neuromorphic chip post) and the other was on why analog made more sense than digital technology for neuromorphic computation (see my Analog neural simulation or Digital neuromorphic computing vs. AI post).

The funny thing is that IBM’s TrueNorth chip uses a lot less power (1000X, milliwatts vs watts) than normal CMOS chips in e use today. Not sure why this would be the case with digital logic but if this is true maybe there’s more of a potential to utilize these sorts of chips in wider applications beyond just traditional AI domains.

How do you program it?

I would really like to get a deeper look at the specs for TrueNorth and its programming model.  But there was a conference last year where IBM presented three technical papers on TrueNorth architecture and programming capabilities (see MIT Technical Report: IBM scientists show blueprints for brain like computing).

Apparently the 1M programming spike neurons are organized into blocks of 256 neurons each (with a prodigious amount of “configurable” synapses as well). These seem equivalent to what I would call a computational unit. One programs these blockss with “corelets” which map out the neural activity that the 256-neuron blocks can perform. Also these corelets “programs” can be linked together or one be subsumed within another sort of like subroutines.  IBM as of last year had a library of 150 corelets which do stuff like detect visual artifacts, motion in a visual image, detect color, etc.

Scale-out neuromorphic chips?

The abstract of the Journal Science paper talked specifically about a communications network interface that allows the TrueNorth chips to be “tiled in two dimensions” to some arbitrary size. So it is apparent that with the TrueNorth design, IBM has somehow extended a within chip block interface that allows corelets to call one another, to go off chip as well. With this capability they have created a scale-out model with the TrueNorth chip.

Unclear why they felt it had to go only two dimensional rather than three but, it seems to mimic the sort of cortex layer connections we have in our brains today. But even with only two dimensional scaling there are all sorts of interesting topologies that are possible.

There doesn’t appear to be any theoretical limit to the number of chips that can be connected in this fashion but I would suppose they would all need to be on a single board or at least “close” together because there’s some sort of time frame that couldn’t be exceeded for propagation delay, i.e., the time it takes for a spike to transverse from one chip to the farthest chip in the chain couldn’t exceed say 10msec. or so.

So how close are we to brain level computations?

In one of my previous post I reported Wikipedia stating that  a typical brain has 86B neurons with between 100M and 500M synapses. I was able to find the 86B number reference today but couldn’t find the 100M to 500M synapses quote again.  However, if these numbers are close to the truth, the ratio between human neurons and synapses is much less in a human brain than in the TrueNorth chip. And TrueNorth would need about 86,000 chips connected together to match the neuronal computation of a human brain.

I suppose the excess synapses in the TrueNorth chip is due to the fact that electronic connection have to be fixed in place for a neuron to neuron connection to exist. Whereas in the brain, we can always grow synapse connections as needed. Also, I read somewhere (can’t remember where) that a human brain at birth has a lot more synapse connections than an adult brain and that part of the learning process that goes on during early life is to trim excess synapses down to something that is more manageable or at least needed.

So to conclude, we (or at least IBM) seem to be making good strides in coming up with a neuromorphic computational model and physical hardware, but we are still six or seven generations away from a human brain’s capabilities (assuming a 1000 of these chips could be connected together into one “brain”).  If a neuromorphic chip generation takes ~2 years then we should be getting pretty close to human levels of computation by 2028 or so.

The Tech Review article said that the 5B transistors on TrueNorth are more transistors than any other chip that IBM has produced. So they seem to be at current technology capabilities with this chip design (which is probably proof that their selection of digital logic was a wise decision).

Let’s just hope it doesn’t take it 18 years of programming/education to attain college level understanding…

Comments?

Photo Credit(s): New 20x [view of mouse cortex] by Robert Cudmore

Vacuum tubes on silicon

Read an interesting article the other day about researchers at NASA having invented a vacuum tube on a chip (see ExtremeTech, Vacuum tube strikes back). Their report was based on an IEEE Spectrum article called Introducing the Vacuum Transistor.

Computers started out early in the last century being mechanical devices (card sorters), moved up to electronic sorters/calculators/computers with vacuum tubes and eventually transitioned to solid state devices with the silicon transistor. Since then the MOS and CMOS transister have pretty much ruled the world of electronic devices.

Vacuum tube?

Vacuum tubes had a number of problems not the least of which was power consumption, size and reliability. It was nothing for a vacuum tube to burn out every couple of times it was powered on and the ENIAC (panel pictured here) had over 17,000 of them, took over 200 sq meters of space, used a lot (150KW) of power and weighed (27 metric) tons.

Of course each vacuum tube was the equivalent of just one transistor and the latest generation Intel Quad Core processors have over 2B transistors in them. So to implement an Intel Quad Core processor with vacuum tubes this might take over 3,000 football fields of space and over 17GW for power/cooling.

There were plenty of niceties with vacuum tubes not the least of which was their nice ruler flat frequency response, ability to support much higher frequencies, significantly less prone to noise and had less problems with radiation than transistors.  This last item meant that vacuum tubes were less susceptible to electromagnetic pulses. Many modern musical/instrument amplifiers are still made today using vacuum tube technology due to their perceived better sound.

But the main problems was their size and power consumption. If you could only shrink a vacuum tube to the size of a MOS field effect transistor (FET) and correspondingly reduce its power consumption, then you would have something.

NASA shrinks the vacuum tube

NASA researchers have shrunk the vacuum tube to nanometer dimensions in a vacuum- channel transistor. They believe it can be fabricated on standard CMOS technology lines and that it can operate at 460GHz. 

This new vacuum-channel transistor marries the benefits of vacuum tubes to the fabrication advantages of MOSFET technology. Making them as small as MOSFET transistors eliminates all of the problems with vacuum tube technology and handily solves a serious problem or two with MOSFETs.

07OLVacuumtransistors-1403115198821

One problem with MOSFET technology today is that we can no longer speed it up any faster than a 4-5GHz.  This limit was reached in 2004 when Intel and others determined that clock speed couldn’t be sped up much more without serious problems resulting and as a result, they started using additional transistors to offer multi-core processor chips.  A lot of time and money is continuing to be spent on seeing how best to offer even more cores but in the end there’s only so much parallelism that can be achieved in most applications and this limits the speed ups that can be attained with multi-core architectures.

But a shrunken vacuum tube doesn’t seem to have the same issues with higher clock speeds.  Also, there is a serious reduction in power consumption that accrues along with reduction in size.

The vacuum in a vacuum tube was there to inhibit electrons from being interfered with by gases. With the vacuum-channel transistor they don’t think they need a vacuum anymore due to the reduction of size and power being used but there’s a little problem on how to creating a helium filled enclosure which they feel will work instead of a vacuum. NASA feels that with todays chip packaging this shouldn’t be a problem.

Also, their current prototypes use 10V but other researchers have reduced other vacuum-channel transistors to use only 1-2v. As of yet the NASA researchers haven’t fabricated their vacuum-channel transistors on a real CMOS line but that’s the next major hurdle.

Imagine a much faster IT

A 400GHz processor in your desktop and maybe a 200GHz processor in your phone/tablet could all be possible with vacuum-channel transistors. They would be so much faster than today’s multi-core systems, that it would be almost impossible to compare the two. Yes there are some apps where multi-core could speed things up considerably but something that’s 10X faster than todays processors would operate much faster than a 10 core CPU. And it still doesn’t mean you couldn’t have multi-core vacuum-channel systems as well.

SSD or NAND flash storage is essentially based on CMOS transistors and the speed of flash is a somewhat of a function of the speed of its transistors.  A 400GHz vacuum-channel transistor could speed up flash storage by an order of magnitude or more. Flash access times are already at the 7µsec level (see my posts on MCS and UltraDIMM storage here and here).  How much of that 7µsec access time is due to the memory channel aand how much is a function of the SanDisk SSD storage is an open question. But whatever portion is on the SSD side could be potentially reduced by a factor of 10 or more with the use of vacuum-channel transistors.

From a disk perspective there are myriad issues that effect how much data can be stored linearly on a disk platter. But one of them is the speed of switching of electromagnetic  (GMR) head and the electronics. Vacuum-channel transistors should be able to eliminate that issue at least in the electronics and maybe with some work in the head as well so disk densities would no longer have to worry about switching speeds. Similar issues apply to magnetic tape densities as well.

Unclear to me how faster switching time would impact network transmission speeds. But it seems apparent that optical transmission times have already reached some sort of limit based on light frequencies used for transmission. However, electronic networking transfer speeds may be able to be enhanced significantly with faster speed switching.

Naturally, WIFI and other forms of radio transmission are seriously impeded by the current frequency and power of electronic switching. That’s one of the reasons why radio stations still depend somewhat on vacuum tubes. However, with vacuum-channel transistors problems with switching speed go away.  Indeed, NASA researchers believe that their vacuum-channel transistors should be able to reach terahertz (1000GHz) transmission switching. Which might make WIFI almost faster than any direct connect networking today.

~~~~
Comments?

Photo Credit(s): ENIAC panel (rear) by Erik Pittit, The Vacuum Tube Transistor from IEEE Spectrum

Thinly provisioned compute clouds

Thin provisioning has been around in storage since StorageTek’s Iceberg hit the enterprise market in 1995.  However, thin provisioning has never taken off for system servers or virtual machines (VMs).

But recently a paper out of MIT Making cloud computing more efficient discusses some recent research that came up with the idea of monitoring system activity to model and predict application performance.

So how does this enable thinly provision VMs?

With a model like this in place, one could concievably provide a thinly provisioned virtual server that could guarantee a QoS and still minimize resource consumption.  For example, have the application VM just consume the resources needed at any instant in time which could be adjusted as demands on the system change.  Thus, as an application  needs grew, more resources could be supplied and as needs shrink, resources could be given up for other uses.

With this sort of server QoS, certain classes of application VMs would need to have variable or no QoS to be sacrificed in times of need to those that required guaranteed QoS. But in a cloud service environment a multiplicity of service classes like these could be supplied at different price points.

Thin provisioning grew up in storage because it’s relatively straightforward for a storage subsystem to understand capacity demands at any instant in time.  A storage system only needs to monitor data write activity and if a data block was written or consumed then it would be backed by real storage. If it had never been written, then it was relatively easy to fabricate a block of zeros if it ever was read.

Prior to thinly provisioned storage, fat provisioning required that storage be configured to the maximum capacity required of it. Similarly, with fully (or fat) provisioned VMs, they must be configured for peak workloads. With the advent of thin provisioning on storage wasted resources (capacity in the case of storage) could be shared across multiple thinly provisioned volumes (LUNs) thereby freeing up these resources for other users.

Problems with server thin provisioning

I see some potential problems with the model and my assumptions as to how thinly provisioned VM would wore. First, the modeled performance is a lagging indicator at best.  Just as system transactions start to get slower, a hypervisor would need to interrupt the VM to add more physical (or virtual) resources.  Naturally during the interruption system performance would suffer.

It would be helpful if resources could be added to a VM dynamically, in real time without impacting the applications running in the VM. But it seems to me that adding physical or virtual CPU cores,  memory, bandwidth, etc., to a VM would require at least some sort of interruption to a pair of VMs [the one giving up the resource(s) and the one gaining the freed up resource(s)].

Similar issues occur for thinly provisioned storage. As storage is consumed for a thinly provisioned volume, allocating more physical capacity takes some amount of storage subsystem resources and time to accomplish.

How does the model work?

It appears that the software model works by predicting system performance based on a limited set of measurements. Indeed, their model is bi-modal. That is there are two approaches:

  • Black box model – tracks server or VM indictors such as “number and type of user requests” as well as system performance and uses AI to correlate the two. This works well for moderate fluctuations in demand but doesn’t help when requests for services falls beyond those boundaries.
  • Grey box model – is more sophisticated and is based on an understanding of a specific database functionality, such as how frequently they flush host buffers, commit transactions to disk logs, etc.  In this case, they are able to predict system performance when demand peaks at 4X to 400X current system requirements.

They have implemented the grey box model for MySQL and are in the process of doing the same for PostGres.

Model validation and availability

They tested their prediction algorithm against published TPC-C benchmark results and were able to come within 80% accuracy for CPU use and 99% accuracy for disk bandwidth consumption.

It appears that the team has released their code as open source. At least one database vendor, Teradata is porting it over to their own database machine to better allocate physical resources to data warehouse queries.

It seems to me that this would be a natural for cloud compute providers and even more important for hypervisor solutions such as vSphere, Hyper-V, etc.  Anyplace one could use more flexibility in assigning virtual or physical resources to an application or server would find use for this performance modeling.

~~~~

Now, if they could just do something to help create thinly provisioned highways, …

Image: Intel Team Inside Facebook Data Center By IntelFreePress

IBM boosts System z processing speed

At this week’s Hot Chips Conference Brian Curran, IBM Distinguished Engineer discussed their recently announced, new faster processing chip for System z mainframe environments that runs at 5.2Ghz.  (FYI, the first 31 minutes of the YouTube video link above are from Brian’s session and the first 10 minutes provides a good overview of the chip.)

Brian discussed System z environments which mainly run large mission critical applications such as OLTP, which use large instruction and data caches.  Also System Z is now being used for Linux consolidation with 1000s of Linux machines running on a mainframe.

The numbers

The new z196 processing core provides up to a 40% improvement executing mainframe applications.  Also, the new processor chip was measured at 50 Billion instructions per second (Bips).

In addition, the z196 achieved a remarkable 40% code thread constant improvement and another 20-30% throughput performance improvement was attainable through re-compilation.  Moreover, they have shown a sustained system execution throughput (multi-thread/multi-application) of 400 Bips.  All this was done without increasing energy consumption over current generation System z processing chips.

Cache everywhere and lots of it

The z196 chip is a 45nm 1.4B transistor, quad core processor with two onboard, special purpose co-processors for cryptographic and compression acceleration. The z196 processing chip has 64KB L1 private I-cache (instruction) and 128KB private D-cache (data), with a 1.5MB private L2 cache. The two L1 & L2 SRAM caches are replicated for each of the four cores.  There is an onboard shared 24MB eDRAM L3 cache as well. With a full 5.2Ghz clock speed across all cores in the z196 quad-core processor group.

Each z196 processing core supports out-of-order instruction execution with a 40 instruction window size.   Further, all data is protected with ECC and hardened with parity and/or duplication for processing steps.

Six of these z196 processing chips combine together to form a processor node on a multi-chip module (MCM).  There is an industry first additional 192MB eDRAM L4 cache shared across the six processing chips on a MCM.  Each System z MCM can interface with up to 750GB of main memory.

In a System z processing frame there can be up to four MCMs, which then provides a total of 96 processing cores.  With the four MCMs, System z can address ~3TB of main memory.  Each MCM is fully interconnected with all other MCMs in a processing frame via a pair of redundant fabric interfaces.

System z is a CISC architecture which with the Z196 has passed the 1000 instruction count barrier (1079 instructions).  Whew, glad I am not coding in Assembler anymore.

IBM formerly announced the chip a month ago and it will be in shipping System z product later this year.

There was some mention by WSJ blogs of Power systems 7+ going up to 5.5Ghz   but I couldn’t locate a more definitive source for that news.

Comments?

Image: Z10 by Roberto Berlim