Know Fortran, optimize NASA code, make money

Read a number of articles this past week about NASA offering a Fortran optimization contest, the High Performance Fast Computing Contest (HPFCC) for their computational fluid dynamics (CFD) program. They want to speed up CFD by 10X to 1000X and are willing to pay for it.

The contest is being run through HeroX and TopCoder and they are offering $55K, across the various levels of the contests to the winners.

The FUN3D CVD code (manual) runs on NASA’s Pliedes Linux supercomputer complex which sports over 245K cores. Even when running on the supercomputer complex, a typical CVD FUN3D run takes thousands to millions of core hours!

The program(s)

FUN3D does a hypersonic fluid analysis over a (fixed) surface which includes a “simulation of mixtures of thermally perfect gases in thermo-chemical equilibrium and non-equilibrium. The routines in PHYSICS_DEPS enable coupling of the new gas modules to the existing FUN3D infrastructure. These algorithms also address challenges in simulation of shocks and boundary layers on tetrahedral grids in hypersonic flows.”

Not sure what all that means but I am certain there’s a number of iterations on multiple Fortran modules, and it does this over a 3D grid of points, which corresponds to both the surface being modeled and the gas mixture, it’s running through at hypersonic speeds. Sounds easy enough.

The contest(s)

There are two levels to the contest: an Ideation phase (at HeroX) and an architectural phase (at TopCoder). The $55000 is split up between the HeroX ideation phase which rewards a total of $20K: $10K for winner and 2-$5K runner up prizes and the TopCoder architectural phase which rewards a total of $35K: $15K for winner and $10K for 2nd place and another $10K for “Qualified improvement candidate”.

The (HeroX) Ideation phase looks for specific new or faster algorithms that could replace current ones in FUN3D which include “exploiting algorithmic developments in such areas as grid adaptation, higher-order methods and efficient solution techniques for high performance computing hardware.”

The (TopCoder) Architecture phase looks at specifically speeding up actual FUN3D code execution. “Ideal submission(s) may include algorithm optimization of the existing code base,  Inter-node dispatch optimization or a combination of the two.  Unlike the Ideation challenge, which is highly strategic, this challenge focuses on measurable improvements of the existing FUN3d suite and is highly tactical.”

Sounds to me that the ideation phase is selecting algorithm designs and the architecture phase is running the new algorithms or just in general speeding up the FUN3D code execution.

The equation(s)

There’s a Navier-Stokes equation algorithm that get’s called maybe a trillion times until the flow settles down, during a run and any minor improvement here would be obviously significant. Perhaps there are algorithmic changes that can be used, if your an aeronautical engineer or perhaps there are compiler speedups that can be found, if your a fortran expert. Both approaches can be validated/debugged/proved out on a desktop computer.

You have to be a US citizen to access the code and you can apply here. You will receive an email to verify your email address and then once your validated and back on the website, you need to approve the software use agreement. NASA will verify your physical address by sending a letter to you with a passcode to use to finally access the code. The process may take weeks to complete, so if your interested in the contest, best to start now.

The Fortran(s)

I learned Fortran 66 a long time ago and may have dabbled with Fortran 77 but that’s the last touched fortran. But it’s like riding a bike, once you do it, it’s easy to do it again.

As I understand it the FUN3D uses Fortran 2003 and NASA suggests you use the Gnu Fortran GFortran compiler as the Intel one has some bugs in it. There appears to be a Fortran 2015 but it’s not in main use just yet.

A million core hours, just amazing. If you could save a millisecond out of the routine called a trillion times, you’d save 1 billion seconds, or ~280K core hours.

Coders start your engines…

 

Quantum computing at our doorsteps

Read an article the other day in MIT’s Technical Review, Google’s new chip is a stepping stone to quantum computing… about Google’s latest endeavor to create quantum computers. Although, digital logic or classical electronic computation has been around since mid last century, quantum logic does things differently and there are many problems that are easier to compute with quantum computing that take much longer to solve with digital computing.

Qubits are weird

Classical or digital electronic computation follows the more physical mechanistic view of the world (for the most part) and quantum computing follows the quantum mechanical view of the world. Quantum computing uses quantum bits or Qubits and the device that Google demonstrated has a 2X3 matrix of qubits, 6 in total.

Unlike a bit, which (theoretically)is a two state system that can only take on the values of 0 and 1, a qubit is a two level system but it can take on an infinitely many number of different states in reality. In practice, with a qubit, there are always two states that are distinguishable from one another but they can be any two states of the infinitely many states they can take on.

Also, reading out the state value of a qubit can be a probabilistic endeavor and can impact the “value” of the qubit that is read out afterwards.

There’s more to quantum computing and I am certainly no expert. So if your interested, I suggest starting with this Arxiv article.

Faster quantum algorithms

In any case some difficult and time consuming arenas of classical computation seem to be easier and faster with quantum computation. For example,

  • Factoring large numbers – in classical computation this process takes an amount of time that is exponential to the number of bits in the “large number”, where “B” is number of bits and “E” epsilon is a constant >0, the best current algorithms take O([1+E]**B) time. But Shor’s quantum factorization algorithm takes only O(B**3) time, which is considerably faster for large numbers. This is important because RSA cryptography and most key exchange algorithms in use today, base their security on the difficulty of factoring large numbers. (See Wikipedia article on Integer Factorization for more information.
  • Searching an unstructured list – in classical computation for a list of N items, it takes on the O(N). But Grover’s quantum search algorithm only takes O(sort[N]) which is considerably faster for large lists. (See Arxiv paper for more information.)

Using the Shor factorization algorithm, they were able to factor the number 15 with 7 qubits.

There are many quantum algorithms available today (see the Quantum Algorithm Zoo at NIST) with more showing up all the time.  Suffice it to say that quantum computing will be a more time efficient and thus, more effective approach to certain problems than classical computing.

Quantum computers starting to scale

Now back to the chip. According to the article the new Googl chip implements a 2X3 matrix of qubits.

For those old enough to remember, this was called an Octal or 3-bit number, ranging from 0 to 7, and two octals can range from 0..64. Octals were used for a long time to represent digital information for some (mostly mini-computers) computers. This is in contrast to most computing nowadays ,which uses Hexadecimal numbers or 4-bit numbers ranging from 0..15, and with two hexadecimal numbers ranging from 0..255.

Why are octals important? Well if quantum computing can scale up multiple octal numbers, then they can start representing really large numbers. According to the article Google chose 2X3 qubit structure because it’s more easy to scale.

I assume all the piping surrounding the chip package in the above photo are cooling ports. It seems that quantum computing only works at very cold temperatures. And if this is a two octals computer, scaling these up to multiple octals is going to take lots of space.

How quickly will it scale?

For some history, Intel introduced their 4004 (4-bit) computing chip in 1971 (Wikipedia), their 8-bit Intel 8008 in 1972 (Wikipedia), their 16-bit Intel 8086 between 1976-78. So in 7 years we went from a 4-bit computer to a 16 bit computer whose (x86) architecture continues on today and rules the world.

Now the Intel 4004 had 16 4-bit registers, had a data/instruction bus that could address 4096 4-bit words, 3-level subroutine stack and was a full fledged 4 bit computer. It’s unclear what’s in Google’s chip. But if we consider that this 2×3-qubit computer, which has multiple 2×3 qubit registers, a qubit storage bus, multi-level qubit subroutine (register) stack, etc. Then we are well on our way to quantum computing being added to the worlds computational capabilities in less than 10 years.

And of course, Googles not the only large organization working on quantum computing.

~~~~

So there you have it, Google and others are in the process of making your cryptography obsolete, rapidly speeding up unstructured searching and doing multiple other computations lots faster than today.

Photo Credit(s): from the MIT Technical Review article.

 

Testing filesystems for CPU core scalability

IMG_6536I attended HotStorage’16 and Usenix ATC’16 conferences this past week and there was a paper presented at ATC titled “Understanding Manicure Scalability of File Systems” (see p. 71 in PDF) by Changwoo Min and others at Georgia Institute of Technology. This team of researchers set out to understand the bottlenecks in a typical file systems as they scaled from 1 to 80 (or more) CPU cores on the same server.

FxMark, a new scalability benchmark

They created a new benchmark to probe CPU core scalability they called FxMark (source code available at FxMark), consisting of 19 “micro benchmarks” stressing specific scalability scenarios and three application level benchmarks, representing popular file system activities.

The application benchmarks in FxMark included: standard mail server (Exim), a NoSQL DB (RocksDB) and a standard user file server (DBENCH).

In the micro benchmarks, they stressed 7 different components of files systems: 1) path name resolution; 2) page cache for buffered IO; 3) node management; 4) disk block management; 5) file offset to disk block mapping; 6) directory management; and 7) consistency guarantee mechanism.
Continue reading “Testing filesystems for CPU core scalability”

TPU and hardware vs. software innovation (round 3)

tpu-2At Google IO conference this week, they revealed (see Google supercharges machine learning tasks …) that they had been designing and operating their own processor chips in order to optimize machine learning.

They called the new chip, a Tensor Processing Unit (TPU). According to Google, the TPU provides an order of magnitude more power efficient machine learning over what’s achievable via off the shelf GPU/CPUs. TensorFlow is Google’s open sourced machine learning  software.

This is very interesting, as Google and the rest of the hype-scale hive seem to have latched onto open sourced software and commodity hardware for all their innovation. This has led the industry to believe that hardware customization/innovation is dead and the only thing anyone needs is software developers. I believe this is incorrect and that hardware innovation combined with software innovation is a better way, (see Commodity hardware always loses and Better storage through hardware posts).
Continue reading “TPU and hardware vs. software innovation (round 3)”

Quasar, data center scheduling reboot

Two people talking to one another in a data center hallway about one person wide with bunches of racks and cabling on either side
Microsoft Bing Maps’ datacenter by Robert Scoble

Read an article today from ZDnet called Data center scheduling as easy as watching a movie. It was about research out of Stanford University that shows how using short glimpses of applications in operation can be used to optimally determine the best existing infrastructure to run it on (for more info, see “Quasar: Resource-Efficient and QoS-Aware Cluster Management”  by Christina Delimitrou and Christos Kozyrakis).

What with all the world’s compute moving to the cloud, the cloud providers are starting to see poor CPU utilization. E.g., AWS’s EC2 average server utilization is typically between 3 and 17%, Google’s is between 25-25% and Twitter’s is consistently below 20%, source: paper above. Such poor utilization at cloud scale causing them to lose a lot of money.

Most cloud organizations and larger companies these days have myriad of servers they have acquired over time. These servers often range from the latest multi-core behemoths, to older servers that have seen better days.

Nonetheless, as new applications come into the mix, it’s hard to know whether they need the latest servers or could get by just as well with some older equipment that happens to be lying around idle in the shop. Because of this inability to ascertain the best infrastructure to run them on, it often leads to over provisioning/under utilization that we see today.

A better way to manage clusters

This is the classic problem that is trying to be solved by cluster management. There are essentially two issues in cluster management for new applications:

  • What resources the application will need to run,
  • Which available servers can best satisfy the application’s resource requirements,

The  first issue is normally answered by the application developer/deployer which they get to specify. When they get this wrong the applications run on severs with more resources than needed which end up being lightly utilized.

But if there was a way to automate the first step in this process?

It turns out if you run a new application for a short time you can determine its execution characteristics. Then if you coluld search a database of applications currently running on your infrastructure you could match how the new application runs with how current applications run and determine a pseudo-optimum fit for the best place to run the new application.

Such a system would need to monitor current applications and determine its server resource usage, e.g., memory use, IO activity, CPU utilization, etc. in your shop. The system would need to construct and maintain a database of applications to server resource utilization. Also, somewhere you would need a database of current server resources in your cluster.

But if you have all that in place, it seems like you could have a solution to the classic cluster management problem presented above.

What about performance critical apps

There’s a class of applications that have stringent QoS requirements that go beyond optimal runtime execution characteristics (latency/throughput sensitive workloads). These applications must be run in environments that can guarantee their latency requirements can be met. This may not be the most optimal location from a cluster perspective but it may be the only place it can run and meet its service objectives.

So any cluster management optimization would also need to factor in such application QoS requirements into its decision matrix on where to run new applications.

Quasar cluster management

The researchers at Stanford have implemented the Quasar cluster management solution which does all that. Today it provides

  1. A way for users to specify application QoS requirements for those applications that require special services,
  2. It takes and runs new applications quickly to ascertain it’s resource requirements and quickly classify its characteristics against a database of currently running applications, and
  3. It allocates new applications to the optimal server configurations that are available.

 

The paper cited above shows results from using Quasar cluster management on hadoop clusters, memcached and Cassandra clusters,  HotCRP clusters as well as a cloud environment. For the cloud environment Quasar has shown that it can boost server utilization for a 200 node cloud environment running 1200 workloads up to 65%.

The paper goes into more detail and there’s more information on Quasar available on Christina Delimitrou’s website.

~~~

Comments?

IBM’s next generation, TrueNorth neuromorphic chip

Ok, I admit it, besides being a storage nut I also have an enduring interest in AI. And as the technology of more sophisticated neuromorphic chips starts to emerge it seems to me to herald a whole new class of AI capabilities coming online. I suppose it’s both a bit frightening as well as exciting which is why it interests me so.

IBM announced a new version of their neuromorphic chip line, called TrueNorth with +5B transistors and the equivalent of ~1M neurons. There were a number of articles on this yesterday but the one I found most interesting was in MIT Technical Review, IBM’s new brainlike chip processes data the way your brain does, (based on a Journal Science article requires login, A million spiking neuron integrated circuit with a scaleable communications network and interface).  We discussed an earlier generation of their SyNAPSE chip in a previous post (see my IBM research introduces SyNAPSE chip post).

How does TrueNorth compare to the previous chip?

The previous generation SyNAPSE chip had a multi-mode approach which used  65K “learning synapses” together with ~256K “programming synapses”. Their current generation, TrueNorth chip has 256M “configurable synapses” and 1M “programmable spiking neurons”.  So the current chip has quadrupled the previous chips “programmable synapses” and multiplied the “configurable synapses” by a factor of a 1000.

Not sure why the configurable synapses went up so high but it could be an aspect of connectivity, something akin to what happens to a “complete graph” which has a direct edge connection to every node in the graph. In a complete graph if you have N nodes then the number of edges is given as [N*(N-1)]/2, which for 1M nodes would be ~500M edges. So it must not be a complete graph, but it’s “close to complete” with 1/2 the number of edges.

Analog vs. Digital?

When last I talked with IBM on their earlier version chip I wondered why they used digital logic to create it rather than analog. They said to be able to better follow along the technology curve of normal chip electronics digital was the way to go.

It seemed to me at the time that if you really  wanted to simulate a brains neural processing then you would want to use an analog approach and this should use much less power. I wrote a couple of posts on the subject, one of which was on MIT’s analog neuromorphic chip (see my MIT builds analog neuromorphic chip post) and the other was on why analog made more sense than digital technology for neuromorphic computation (see my Analog neural simulation or Digital neuromorphic computing vs. AI post).

The funny thing is that IBM’s TrueNorth chip uses a lot less power (1000X, milliwatts vs watts) than normal CMOS chips in e use today. Not sure why this would be the case with digital logic but if this is true maybe there’s more of a potential to utilize these sorts of chips in wider applications beyond just traditional AI domains.

How do you program it?

I would really like to get a deeper look at the specs for TrueNorth and its programming model.  But there was a conference last year where IBM presented three technical papers on TrueNorth architecture and programming capabilities (see MIT Technical Report: IBM scientists show blueprints for brain like computing).

Apparently the 1M programming spike neurons are organized into blocks of 256 neurons each (with a prodigious amount of “configurable” synapses as well). These seem equivalent to what I would call a computational unit. One programs these blockss with “corelets” which map out the neural activity that the 256-neuron blocks can perform. Also these corelets “programs” can be linked together or one be subsumed within another sort of like subroutines.  IBM as of last year had a library of 150 corelets which do stuff like detect visual artifacts, motion in a visual image, detect color, etc.

Scale-out neuromorphic chips?

The abstract of the Journal Science paper talked specifically about a communications network interface that allows the TrueNorth chips to be “tiled in two dimensions” to some arbitrary size. So it is apparent that with the TrueNorth design, IBM has somehow extended a within chip block interface that allows corelets to call one another, to go off chip as well. With this capability they have created a scale-out model with the TrueNorth chip.

Unclear why they felt it had to go only two dimensional rather than three but, it seems to mimic the sort of cortex layer connections we have in our brains today. But even with only two dimensional scaling there are all sorts of interesting topologies that are possible.

There doesn’t appear to be any theoretical limit to the number of chips that can be connected in this fashion but I would suppose they would all need to be on a single board or at least “close” together because there’s some sort of time frame that couldn’t be exceeded for propagation delay, i.e., the time it takes for a spike to transverse from one chip to the farthest chip in the chain couldn’t exceed say 10msec. or so.

So how close are we to brain level computations?

In one of my previous post I reported Wikipedia stating that  a typical brain has 86B neurons with between 100M and 500M synapses. I was able to find the 86B number reference today but couldn’t find the 100M to 500M synapses quote again.  However, if these numbers are close to the truth, the ratio between human neurons and synapses is much less in a human brain than in the TrueNorth chip. And TrueNorth would need about 86,000 chips connected together to match the neuronal computation of a human brain.

I suppose the excess synapses in the TrueNorth chip is due to the fact that electronic connection have to be fixed in place for a neuron to neuron connection to exist. Whereas in the brain, we can always grow synapse connections as needed. Also, I read somewhere (can’t remember where) that a human brain at birth has a lot more synapse connections than an adult brain and that part of the learning process that goes on during early life is to trim excess synapses down to something that is more manageable or at least needed.

So to conclude, we (or at least IBM) seem to be making good strides in coming up with a neuromorphic computational model and physical hardware, but we are still six or seven generations away from a human brain’s capabilities (assuming a 1000 of these chips could be connected together into one “brain”).  If a neuromorphic chip generation takes ~2 years then we should be getting pretty close to human levels of computation by 2028 or so.

The Tech Review article said that the 5B transistors on TrueNorth are more transistors than any other chip that IBM has produced. So they seem to be at current technology capabilities with this chip design (which is probably proof that their selection of digital logic was a wise decision).

Let’s just hope it doesn’t take it 18 years of programming/education to attain college level understanding…

Comments?

Photo Credit(s): New 20x [view of mouse cortex] by Robert Cudmore

Vacuum tubes on silicon

Read an interesting article the other day about researchers at NASA having invented a vacuum tube on a chip (see ExtremeTech, Vacuum tube strikes back). Their report was based on an IEEE Spectrum article called Introducing the Vacuum Transistor.

Computers started out early in the last century being mechanical devices (card sorters), moved up to electronic sorters/calculators/computers with vacuum tubes and eventually transitioned to solid state devices with the silicon transistor. Since then the MOS and CMOS transister have pretty much ruled the world of electronic devices.

Vacuum tube?

Vacuum tubes had a number of problems not the least of which was power consumption, size and reliability. It was nothing for a vacuum tube to burn out every couple of times it was powered on and the ENIAC (panel pictured here) had over 17,000 of them, took over 200 sq meters of space, used a lot (150KW) of power and weighed (27 metric) tons.

Of course each vacuum tube was the equivalent of just one transistor and the latest generation Intel Quad Core processors have over 2B transistors in them. So to implement an Intel Quad Core processor with vacuum tubes this might take over 3,000 football fields of space and over 17GW for power/cooling.

There were plenty of niceties with vacuum tubes not the least of which was their nice ruler flat frequency response, ability to support much higher frequencies, significantly less prone to noise and had less problems with radiation than transistors.  This last item meant that vacuum tubes were less susceptible to electromagnetic pulses. Many modern musical/instrument amplifiers are still made today using vacuum tube technology due to their perceived better sound.

But the main problems was their size and power consumption. If you could only shrink a vacuum tube to the size of a MOS field effect transistor (FET) and correspondingly reduce its power consumption, then you would have something.

NASA shrinks the vacuum tube

NASA researchers have shrunk the vacuum tube to nanometer dimensions in a vacuum- channel transistor. They believe it can be fabricated on standard CMOS technology lines and that it can operate at 460GHz. 

This new vacuum-channel transistor marries the benefits of vacuum tubes to the fabrication advantages of MOSFET technology. Making them as small as MOSFET transistors eliminates all of the problems with vacuum tube technology and handily solves a serious problem or two with MOSFETs.

07OLVacuumtransistors-1403115198821

One problem with MOSFET technology today is that we can no longer speed it up any faster than a 4-5GHz.  This limit was reached in 2004 when Intel and others determined that clock speed couldn’t be sped up much more without serious problems resulting and as a result, they started using additional transistors to offer multi-core processor chips.  A lot of time and money is continuing to be spent on seeing how best to offer even more cores but in the end there’s only so much parallelism that can be achieved in most applications and this limits the speed ups that can be attained with multi-core architectures.

But a shrunken vacuum tube doesn’t seem to have the same issues with higher clock speeds.  Also, there is a serious reduction in power consumption that accrues along with reduction in size.

The vacuum in a vacuum tube was there to inhibit electrons from being interfered with by gases. With the vacuum-channel transistor they don’t think they need a vacuum anymore due to the reduction of size and power being used but there’s a little problem on how to creating a helium filled enclosure which they feel will work instead of a vacuum. NASA feels that with todays chip packaging this shouldn’t be a problem.

Also, their current prototypes use 10V but other researchers have reduced other vacuum-channel transistors to use only 1-2v. As of yet the NASA researchers haven’t fabricated their vacuum-channel transistors on a real CMOS line but that’s the next major hurdle.

Imagine a much faster IT

A 400GHz processor in your desktop and maybe a 200GHz processor in your phone/tablet could all be possible with vacuum-channel transistors. They would be so much faster than today’s multi-core systems, that it would be almost impossible to compare the two. Yes there are some apps where multi-core could speed things up considerably but something that’s 10X faster than todays processors would operate much faster than a 10 core CPU. And it still doesn’t mean you couldn’t have multi-core vacuum-channel systems as well.

SSD or NAND flash storage is essentially based on CMOS transistors and the speed of flash is a somewhat of a function of the speed of its transistors.  A 400GHz vacuum-channel transistor could speed up flash storage by an order of magnitude or more. Flash access times are already at the 7µsec level (see my posts on MCS and UltraDIMM storage here and here).  How much of that 7µsec access time is due to the memory channel aand how much is a function of the SanDisk SSD storage is an open question. But whatever portion is on the SSD side could be potentially reduced by a factor of 10 or more with the use of vacuum-channel transistors.

From a disk perspective there are myriad issues that effect how much data can be stored linearly on a disk platter. But one of them is the speed of switching of electromagnetic  (GMR) head and the electronics. Vacuum-channel transistors should be able to eliminate that issue at least in the electronics and maybe with some work in the head as well so disk densities would no longer have to worry about switching speeds. Similar issues apply to magnetic tape densities as well.

Unclear to me how faster switching time would impact network transmission speeds. But it seems apparent that optical transmission times have already reached some sort of limit based on light frequencies used for transmission. However, electronic networking transfer speeds may be able to be enhanced significantly with faster speed switching.

Naturally, WIFI and other forms of radio transmission are seriously impeded by the current frequency and power of electronic switching. That’s one of the reasons why radio stations still depend somewhat on vacuum tubes. However, with vacuum-channel transistors problems with switching speed go away.  Indeed, NASA researchers believe that their vacuum-channel transistors should be able to reach terahertz (1000GHz) transmission switching. Which might make WIFI almost faster than any direct connect networking today.

~~~~
Comments?

Photo Credit(s): ENIAC panel (rear) by Erik Pittit, The Vacuum Tube Transistor from IEEE Spectrum

Thinly provisioned compute clouds

Thin provisioning has been around in storage since StorageTek’s Iceberg hit the enterprise market in 1995.  However, thin provisioning has never taken off for system servers or virtual machines (VMs).

But recently a paper out of MIT Making cloud computing more efficient discusses some recent research that came up with the idea of monitoring system activity to model and predict application performance.

So how does this enable thinly provision VMs?

With a model like this in place, one could concievably provide a thinly provisioned virtual server that could guarantee a QoS and still minimize resource consumption.  For example, have the application VM just consume the resources needed at any instant in time which could be adjusted as demands on the system change.  Thus, as an application  needs grew, more resources could be supplied and as needs shrink, resources could be given up for other uses.

With this sort of server QoS, certain classes of application VMs would need to have variable or no QoS to be sacrificed in times of need to those that required guaranteed QoS. But in a cloud service environment a multiplicity of service classes like these could be supplied at different price points.

Thin provisioning grew up in storage because it’s relatively straightforward for a storage subsystem to understand capacity demands at any instant in time.  A storage system only needs to monitor data write activity and if a data block was written or consumed then it would be backed by real storage. If it had never been written, then it was relatively easy to fabricate a block of zeros if it ever was read.

Prior to thinly provisioned storage, fat provisioning required that storage be configured to the maximum capacity required of it. Similarly, with fully (or fat) provisioned VMs, they must be configured for peak workloads. With the advent of thin provisioning on storage wasted resources (capacity in the case of storage) could be shared across multiple thinly provisioned volumes (LUNs) thereby freeing up these resources for other users.

Problems with server thin provisioning

I see some potential problems with the model and my assumptions as to how thinly provisioned VM would wore. First, the modeled performance is a lagging indicator at best.  Just as system transactions start to get slower, a hypervisor would need to interrupt the VM to add more physical (or virtual) resources.  Naturally during the interruption system performance would suffer.

It would be helpful if resources could be added to a VM dynamically, in real time without impacting the applications running in the VM. But it seems to me that adding physical or virtual CPU cores,  memory, bandwidth, etc., to a VM would require at least some sort of interruption to a pair of VMs [the one giving up the resource(s) and the one gaining the freed up resource(s)].

Similar issues occur for thinly provisioned storage. As storage is consumed for a thinly provisioned volume, allocating more physical capacity takes some amount of storage subsystem resources and time to accomplish.

How does the model work?

It appears that the software model works by predicting system performance based on a limited set of measurements. Indeed, their model is bi-modal. That is there are two approaches:

  • Black box model – tracks server or VM indictors such as “number and type of user requests” as well as system performance and uses AI to correlate the two. This works well for moderate fluctuations in demand but doesn’t help when requests for services falls beyond those boundaries.
  • Grey box model – is more sophisticated and is based on an understanding of a specific database functionality, such as how frequently they flush host buffers, commit transactions to disk logs, etc.  In this case, they are able to predict system performance when demand peaks at 4X to 400X current system requirements.

They have implemented the grey box model for MySQL and are in the process of doing the same for PostGres.

Model validation and availability

They tested their prediction algorithm against published TPC-C benchmark results and were able to come within 80% accuracy for CPU use and 99% accuracy for disk bandwidth consumption.

It appears that the team has released their code as open source. At least one database vendor, Teradata is porting it over to their own database machine to better allocate physical resources to data warehouse queries.

It seems to me that this would be a natural for cloud compute providers and even more important for hypervisor solutions such as vSphere, Hyper-V, etc.  Anyplace one could use more flexibility in assigning virtual or physical resources to an application or server would find use for this performance modeling.

~~~~

Now, if they could just do something to help create thinly provisioned highways, …

Image: Intel Team Inside Facebook Data Center By IntelFreePress