I’ve been writing about neuromorphic chips since 2011, 8 long years (see IBM SyNAPSE chip post from 2011 or search my site for “neuromorphic”) and none have been successfully reached the market. The problems with neurmorphic architectures have always been twofold, scaling AND software.
Scaling up neurons
The human brain has ~86B neurons (see wikipedia human brain article). So, 8 million neuromorphic neurons is great, but it’s about 10K X too few. And that doesn’t count the connections between neurons. Some human neurons have over 1000 connections between nerve cells (can’t seem to find this reference anymore?).
To get from a single chip with 125K neurons to their 8M neuron system, Intel took 64 chips and put them on a couple of boards. To scale that to 86B or so would take ~690, 000 of their neuromorphic chips. Now, no one can say if there’s not some level below 85B neuromorphic neurons, that could support a useful AI solution, but the scaling problem still exists.
Then there’s the synapse connections between neuromorphic neurons problem. The article says that Loihi chips are connected in a heirarchical routing network, which implies to me that there are switches and master switches (and maybe a really big master switch) in their 8M neuromorphic neuron system. Adding another 4 orders of magnitude more neuromorphic neurons to this may be impossible or at least may require another 4 sets of progressively larger switches to be added to their interconnect network. There’s a question of how many hops and the resultant latency in connecting two neuromorphic neurons together but that seems to be the least of the problem with neuromorphic architectures.
Missing software abstractions
The first time I heard about neuromorphic chips I asked what the software looks like and the only thing I heard was that it was complex and not very user friendly and they didn’t want to talk about it.
I keep asking about software for neuromorphic chips and still haven’t gotten a decent answer. So, what’s the problem. In today’s day and age, software is easy to do, relatively inexpensive to produce and can range from spaghetti code to a hierarchical masterpieces, so there’s plenty of room to innovate here.
But whenever I talk to engineers about what the software looks like, it almost seems like a software version of an early plug board unit-record computer (essentially card sorters). Only instead of wires, you have software neuromorphic network connections and instead of electro-magnetic devices, one has software spiking neuromorphic neuron hardware.
The way we left plugboards behind was by building up hardware abstractions such as adders, shifters, multipliers, etc. and moving away from punch cards as a storage medium. Somewhere along this transition, we created programing languages like (macro) Assemblers, COBOL, FORTRAN, LISP, etc. It’s the software languages that brought computing out of the labs and into the market.
It’s been at least 8 years now, and yet, no-one has built a spiking neuromorphic computer language yet. Why not?
I think the problem is there’s no level of abstraction above a neuron. Where’s the aritmetic logic unit (ALU) or register equivalents in neuromorphic computers? They don’t exist as far as I can see.
Until we can come up with some higher levels of abstraction, coding neuromorphic chips is going to be an engineering problem not a commercial endeavor.
But neuromorphism has advantages
The IEEE article states a couple of advantages for neuromorphic computing: less energy to perform inferencing (and possibly training) and the ability to train on incremental data rather than having to train across whole datasets again.
And the incremental training issue doesn’t seem any easier when you have ~80B neurons, with an occasional 1000s of connections between them to adjust correctly. From my perspective, its training advantage seems illusory at best.
Another advantage of neuromorphism is that it simulates the real analog logic of a human brain. Again, that’s great but a brain takes ~22 years to train (college level). Maybe because neuromorphic chips are electronic perhaps training can be done 100 times faster. But there’s still the software issue
I hate to be the bearer of bad news. There’s been some major R&D spend on neuromorphism and it continues today with no abatement.
I just think we’d all be better served figuring out how to program the beast than on –spending more to develop more chip hardware..
This is hard for me to say, as I have always been a proponent of hardware innovation. It’s just that neuromorphic software tools don’t exist yet. And I’m afraid, I don’t see any easy way forward to make any progress on this.
The problems with standard floating point have been known since they were first defined, in 1985 by the IEEE. As you may recall, an IEEE 754 floating point number has three parts a sign, an exponent and a mantissa (fraction or significand part). Both the exponent and mantissa can be negative.
Half precisionfloating point, (added in 2008), has 1 sign bit (for the significand or mantissa), 5 exponent bits (indicating 2**-62 to 2**+64) and 10 significand bits for a total of 16 bits.
Single precision floating point, has 1 sign bit, 8 exponent bits (indicating 2**-126 to 2**+128) and 23 significand bits for a total of32 bits.
Double precision floating point, has 1 sign bit, 11 exponent bits (2**-1022 to 2**+1024) and 52 significand bits.
Quadrouple precision floating point, has 1 sign bit, 15 exponent bits (2**-16,382 to 2**+16,384) and 112 significand bits.
I believe Half precision was introduced to help speed up AI deep learning training and inferencing.
Some problems with the IEEE standard include, it supports -0 and +0 which have different representations and -∞ and +∞ as well as can be used to represent a number of unique, Not-a-Numbers or NaNs which are illegal floating point numbers. So when performing IEEE standard floating point arithmetic, one needs to check to see if a result is a NaN which would make it an illegal result, and must be wary when comparing numbers such as -0, +0 and -∞ , +∞. because, sigh, they are not equal.
Posits to the rescue
It’s all a bit technical (read the paper to find out) but posits don’t support -0 and +0, just 0 and there’s no -∞ or +∞ in posits either, just ∞. Posits also allow for a variable number of exponent bits (which are encoded into Regime scale factor bits [whose value is determined by a useed factor] and Exponent scale factor bits) which means that the number of significand bits can also vary.
So, with a 32 bit, single precision Posit, the number range represented can be quite a bit larger than single precision floating point. Indeed, with the approach put forward by Gustafson, a single 32 bit posit has more numeric range than a single precision IEEE 754 float and about as 1/2 as much range as double precision IEEE floating point number but only uses 32 bits.
Presently, there’s no commercial hardware implementations of posits, but there’s a lot of interest. Mostly because, the same number of bits can represent a lot more numeric range than equivalently sized IEEE 754 floats. And for HPC environments, AI deep learning applications, scientific computing, etc. having more numeric range (or precision), in less space, means they can jam more data in the same storage, transfer more data over the same networking bandwidth and save more numbers in limited amounts of DRAM.
Although, commercial implementations do not exist, there’s been some FPGA simulations of posit floating point arithmetic. Those simulations have shown it to be more energy efficient than IEEE 754 floating point arithmetic for the same number of bits. So, you need to add better energy efficiency to the advantages of posit arithmetic.
Is it any wonder that HPC/big science (weather prediction, Square Kilometer Array, energy simulations, etc.) and many AI hardware accelerator chip designers are examining posits as a potential way to boost precision, reduce storage/memory footprint and reduce energy consumption.
Yet, standards have a way of persisting. Just look at how long the QWERTY keyboard has lasted. It was originally designed in the 1870’s to slow down typing and reduce jamming, when typewriters were mechanical devices. But ever since 1934, when the DVORAK keyboard was patented, there’s been much better layouts for keyboards. And there’s no arguing that the DVORAK keyboard is better for typing on non-mechanical typewriters. Yet today, I know of no computer vendor that ships DVORAK labeled keyboards. Once a standard becomes set, it’s very hard to dislodge.
Was at a conference last month where there was discussion of the “cloudless” future. This is so wrong, clouds are a threat to every IT hardware and software vendor out there and it’s not going away
The hardware side is easy to see.
Clouds threat to IT hardware vendors
On the storage side, the big hyperscalers have adopted software defined storage from the git go. Smaller ones are migrating that way as well and it’s even impacting data centers as the big virtualization software vendors release more and more functionality in SwDefStorage
And on the networking side, the clouds were an early adopter of Openflow, software defined networking. OpenFlow gear still requires specialized hardware but mostly it’s just a server with PCIe accelerator cards that perform high speed switching. Ditto the prior paragraph here as the virtualization vendors are also moving their networking to SwDefNetworking.
Luckily for servers there’s no such thing as a SwDefServer, yet. But server offerings are under just as big a threat from the cloud. Hyper-scalars are sophisticated enough to design their own server hardware and have it manufactured to spec. The smaller ones can make use of whitebox servers. Both of them, at the volumes they consume servers, can force a race to the bottom on pricing.
So server vendors are being relegated to the data center for the most part. And as data center servers become more powerful, virtualized environments need less of them.
The threat to IT software vendors
Make no mistake about it, software is under just as much threat as hardware. AWS and Oracle was probably the best example of how this works. Oracle was once a profitable niche market on AWS. Today, Oracle is not even available on AWS marketplace anymore.
This sort of dynamic can happen to any solution where acceptable open source alternatives exist. With the cloud’s sophistication and volumes they can easily take the sting out of using open source by providing ease of deployment, use and maintenance along with performance scalability. That makes running open source on clouds as easy as any packaged solution.
Albeit, maybe the cloud may not offer the support or hand-holding one obtains with packaged software. But open source can be very responsive to bugs/security exposures. Cloud providers can take the time to make their open source solutions bullet proof. And with 1000s to 10,000s of users, running them at scale, it should be easy enough to find any high profile bugs.
Even all those software vendors that make software that executes only on the cloud, to make it run better, more secure or to add some unique functionally are at risk. All these vendors ultimately will suffer by “death from marketplace success“. As they become successful and cloud vendors know inherently how successful they are, they become more interesting to the cloud. Over time more successful solutions will attract cloud provider functionally-equivalent, open source alternatives, that will push them out of the clouds marketplace.
Dealing with the threat to hardware vendors
Hardware vendors have four grand strategies to address the cloud threat.
Stick head in sand, hope it goes away (or at least takes a long time to kill them off). There are still some major vendors with this mindset. Yes, slowly but surely they are coming around to see the light but they think they have a long enough runway to hold on until something better comes along.
Co-opt the cloud by providing unique, hardware capabilities in their cloudenvironment. There are a few hardware vendors that have adopted this strategy. This buys them more time as they can depend on current data center revenues and over time augment this with cloud revenues.
Join the race to the bottom to become a supplier to clouds. Most hardware vendors started out in a highly competitive environment, but over time they have lost their way (found a higher profitability niche). But lurking in their past somewhere, there’s a competitiveness streak that’s dying to come out. The race to the bottom may never be as profitable as data centers but there’s significant revenue to be had here.
Co-opt the cloud by providing services that span multiple clouds. Not exactly creating a hybrid cloud but rather providing a multi-cloud hardware service. Hardware functionality that can be accessed from multiple clouds can enjoy some advantages of the cloud but at the same time generate data center like revenues..
I may be missing some grand hardware vendor strategies but as I’ve talked with hardware vendors over time these seem to be the main ones moving ahead.
I’ve tried a couple of times to talk to vendors in the #1 mindset above about the futility of their approach. Mostly, I get ignored or at best politely brushed off as being alarmist. Their main hope is that the data center continues on in the present environment and that they can retain their dominance there.
Maybe they have a point. The 1960smainframe environment still exists today. And IBM still remains dominant there, and generates profits there. But it just doesn’t matter that much to IT anymore. IT has moved on. .
Something similar will happen to IT’s data center. Yes it will still exist forever, and perhaps some vendors can continue to profit there.
But the vast majority of IT workloads will be moving to the cloud over time, relegating this to a smaller (proportionally) niche market. They’ve been saying tape is dead since 1967, but it’s still alive, it’s just moved from being a large market to a smaller one (proportionally).
The #2 mindset vendors have a clearer view of wha’s happening with the cloud. They are moving select hardware functionality out to the cloud as soon as they are able. Some are even placing their hardware in cloud provider availability zones (data centers) to support this. We all hope they enjoy lasting success doing this.
But ultimately they to, shall suffer the same fate as software vendors above, due to the cloud’s death by marketplace success. The more successful they become, the higher the likelihood that the cloud providers will go after them with their own functionally-equivalent, software defined solution.
I’m not privy to the contracts between hardware vendors and cloud providers bit perhaps this later transition, to outright competition, can be forestalled enough to make the cloud providers reluctant to compete with them. But hardware success can only lead to more cloud interest and no contract can protect against every contingency.
Those vendors adopting the #3 mindset have to return to their competitiveness roots. Doing this will never be as profitable as today’s data center. So the transition will be painful, but they need to do this soon, while they still have some profits coming from data center sales. The sooner they can deploy these $s to fix supply chains, manufacturing quality/production, drastically slim down marketing and sales, the faster they can start supplying the clouds with appropriate hardware. Profitability will suffer early on but it may never fully recover.
The #4 mindset applies equally well to software vendors as well as hardware vendors but the hardware group seems to be doing this already. Many storage vendors have multi-cloud solutions with hardware positioned in cloud-adjacent facilities that can be accessed from multiple clouds. Such services have to be consumable like any cloud service. But once in place they have a unique value proposition, the ability to move work and data from one cloud to another.
But the only thing stopping cloud providers doing something similar is that they don’t want to help any current user to use a different cloud. Again, depending on how successful this multi-cloud approach becomes, there’s nothing prohibiting the cloud providers from providing similar functionality.
Dealing with the threat to software vendors
Software vendors see 4 grand strategies to deal with the cloud threat:
If you can’t beat them, join them, and create their own cloud. IBM exemplifies this best but one can see this with Microsoft, Oracle, SAP and others. If they can create their own cloud, they can start to compete with cloud providers on an equal footing. Yes they will be smaller but they can enjoy many of the same benefits of bigger clouds, just not as much. .
Offer their software services/stack on the cloud providers. This is similar to the hardware vendors #2 mindset. Yet this has suffered from death by marketplace success since the inception.
Co-opt the cloud by providing services that fuse the data center and the cloud environments. Thus creating hybrid cloud solutions that span the data center-cloud environment which seem to have a longer runway. But this lasts only as long as the data center is a significant market.
Co-opt the cloud by providing services that span cloud provider vendors. Multi-cloud solutions are more apparent for hardware, but nothing prohibits a software vendor from offering services that spans clouds.
I may be missing a few grand strategies here but these seem to be the major ones software vendors are using to deal with the cloud. And just like hardware vendors above, much of the success of these strategies (at least #2,3 &4) depends on flying under the radar of cloud providers. Limiting your success may give you some time to eek out a decent revenue/profitability stream, while the cloud provider kills off the more successful solutions ahead of you. But you’re all living on borrowed time.
The most interesting one is #1. Yes economies of scale will matter, which will make their long term viability a concern. But at least you can be on the same playing field. Most of these companies have sizable treasure chests and if invest serious money on their own clouds, they may have a chance to survive.
Cloud providers are taking their time
The other thing that’s prolonging the data center and correspondingly vendors existence is cloud providers expenses. With all their hardware volumes, use of white box or own designed hardware and open source software, does it make any sense that IT could provide matching services in data centers by themselves.
But something is chewing up their revenues, Maybe it’s marketing, customer acquisition, software/hardware development or support expenses. I tend to think it’s trying to keep pace with customer growth. They end up having to anticipate this growth ahead of time and position hardware, software and services before the customers exist to use them.
I don’t think there’s anything more mysterious to their lack of profitability than that. They all want all the customers they can get. They are have significant growth and they are all charging a premium for their service. However, I may be wrong.
But how long can such hyper-growth last. At some point, as more and more IT organizations move to the cloud this growth will slow, prices will start to come down and it will set off a vicious cycle, more cloud success brings more volumes, less overhead and should lower prices which brings more cloud success.
More cloud success brings less volumes for hardware and software vendors, more overhead and ultimately higher prices.
None of the above solutions seem that attractive to hardware or software vendors but I see only a few ways forward for all of them.
In part 2, I’ll discuss some out of the box strategies that move beyond the data center and the cloud that may be just the way forward for hardware and software vendors need to take the cloud on.
I attended HPE Discover conference in Vegas this past week and among all their product announcements, there was a panel discussion on something called Swarm Intelligence. But it was really about collaborative learning.
Swarm Intelligence at HPE is a way for multiple organizations/edge devices to train a model collaboratively. They end up using their own data to train local models but then share their models (actually model node weights) with one another.
In this fashion, if say one hospital specializes in the detection and treatment of pneumonia and another in TB, they could both train a shared model on their respective sets of data. But during training, they share their model weights between them and after some number of training iterations, have a single model than supports detection of both.
How does swarm intelligence work?
To make swarm intelligence work:
All parties have to reach consensus on model hyper parameters, i.e, type of model (CNN, RNN, LSM, etc.), number of nodes per layer, number of layers, levels of connections between nodes, etc. So there’s a single model architecture to be trained across all the organizations.
All organization training data needs to be the same type, (e.g., X-rays).
After each model training session all model weights have to be shared with each other
All organizations have to decide on the method used to merge or combine the model weights (e.g. averaging). .
In the end, after N number of training epochs, their combined model would be essentially cross trained on each organizations data. But no one shared any data!
Why attempt swarm intelligence
HPE believes swarm intelligence would be a way to not have to transmit all that edge data to a central repository, but there’s other advantages:
A combined model could be trained on more data than any single organization could provide.
A combined model would have less organizational bias.
There’s one other possibility, but it’s unclear whether this is legally valid or not, but a combined model could be trained on data that it didn’t have legal access to.
One problem with the edge is the vast amount of data there
It turns out that a self driving car could generate 4TB of data/day of driving. Moving 4TB a day from all the cars in say a major metropolitan area (4 million people with ~1 million cars of which 20% are on the road each week day) could represent as much as 200K*4TB or 200EB of data/day.
There is not enough bandwidth in a fully 5G world to move that amount of data each day wirelessly and probably not enough bandwidth to move that amount of data over wire.
But if each car were to train its own (self-driving) model each day on its own data and then share that training model of say 1024 nodes with 1024 layers, it would represent 1M node weights ( floating point numbers), or ~16MB of data, if done effectively, one could have a cities worth of training data to train your self driving car models.
The allure of swarm intelligence/collaborative learning is high. It seems a small cost to reach consensus on the model hyper-parameters, collaborative learning methodology and synchronized training epochs to create a model trained on multiple organization/edge device training data.
HPE discussed using private blockchains to coordinate the sharing of model training across organizations or edge devices and use the block chain to compensate organizations for the use of their trained models. Certainly this could work well with edge devices but it seems an unnecessary complication for collaborative organizations.
Nonetheless, swarm intelligence may just be one way to address some of serious problems with deep learning today.
Read three articles this past week or so that shed some light on Photonic computing, that is doing computing with light rather than electronic signals. We discuss the first two directly below and save the last for the end.
Luminous Computings, CTO Mitchell Nahmias has helped author a number of papers, articles, & books on photonics with others from Princeton University, perhaps the most impressive being Principals of neuromorphic photonics. If they are following this approach, it would seem likely they are creating a new photonic neuromorphic chip.
There is absolutely minimal information on their web site so, I am assuming they’re targeting a neuromorphic chip from the CTO’s history of research publications. I could be completely incorrect on this assumption but will continue with it for now.
Neuromorphic AI is in contrast to standard neural network deep learning approaches to AI. We have discussed Neuromorphic computing in a number of posts (e.g. see IBM introduces the SYNAPSE chip from 2011 or University of Manchester fires up the largest neuromorphic chip from 2018, more if interested, just search for “Neuromorphic”). Neuromorphic approaches to AI although had early promise, have been losing the technological arms race, as standard AI DL using GPUs, implementing neural networks, seems to have vastly more adherents both in academia as well as industry.
On the other hand, simulating actual neurons, as in neuromorphic computing, has the potential to be a significant breakthrough. But only if it can be orders of magnitude faster, cheaper or somehow better than standard AI DL neural network processing.
In one of the news reports, it mentioned that Luminous Computing has working silicon. Assuming their approach has significant AI training advantages, will it also require a neuromorphic chip to perform AI inferencing?
Photonics from MIT
The MIT group states, that based on their simulation of the photonics chip, they can perform AI DL using neural network training with much less energy. In their paper, they mention that their matrix multiplication is performed passively by optical interference alone. In this case, both the input and the multiply and accumulate function (MAC) are done using photonics.
AI DL neural network training takes multiple iterations of matrix multiplications and accumulate (MAC) functions. A standard CPU takes about 20 pj/MAC (10**-12 pico-Joules/MAC) and a GPU takes about 1 pj/MAC. Whereas in simulation they were able to show their photonic chip should be able to perform 10 fJ/MAC (10**-15 femto-Joules/MAC) This would put the MIT photonic chip designpower consumption per MAC ~1000X better than CPUs and ~100X better than GPUs.
The paper goes on to say that theoretically, they could perform at 50 to 100 zJ/MAC (10**-21 zepto-Joules/MAC), which would add another factor of 1000 to their power advantages.
In addition, MIT’s photonic chip design is being constructed using “free [space] optical components” and should be able to scale much better than “nano photonic implementations”. Nano photonics requires waveguides and optical couplers whereas free-space photonics uses lasers beams traveling through space.
Their paper claims that nano-photonics can only support ~1000 node neural networks but with their free-space photonic solution, they believe they can support a 1,000,000 node neural network. Not sure how this would work in a chip design with lasers visible, being routed via micro-mirrors and other optical components, around a chip.
Nothing I could find indicates that the MIT team has working silicon nor have spun out a startup with seed funding.
One can argue with the numbers. Also, the fact that the car has historically consumed gas, a non-renewable energy source and the GPU could theoretically use solar, wind or hydro renewable power for its energy needs also needs to be stated. Moreover, one trains a AI DL (NLP) model once with all the data and then may subsequently train it on new data that comes in, but it gets used a gazillion times. Most adults (at least in the US) have a car and drive it often, multiplying automobile energy impacts by 100s of millions in the US alone.
But the fact is that AI DL training, that iterates multiple times over 1000s to 1,000,000s of data items, takes energy and lots of it. How that energy is produced matter to global warming. And anything that has the potential to reduce this energy consumption should be welcomed.
Photonics may be ready for prime time but only if they can significantly improve AI training .
Read a couple of articles the past few weeks that highlighted something that not many of us are aware of, most of the data used to train AI deep learning (DL) models comes from us.
That is through our ignorance or tacit acceptation of licenses for apps that we use every day and for just walking around/interacting with the world.
The article in Atlantic, The AI supply chain runs on ignorance, talks about Ever, a picture sharing app (like Flickr), where users opted in to its facial recognition software to tag people in pictures. Ever also used that (tagged by machine or person) data to train its facial recognition software which it sells to government agencies throughout the world.
The article went on to say that gathering photos from people in public places is not against the law. The study was also cleared by the school. The database was not released until after the students graduated but it did have information about the time and date the photos were taken.
But that’s nothing…
The same thing applies to video sharing and photo animation models, podcasting and text speaking models, blogging and written word generation models, etc. All this data is just lying around the web, freely available for any AI DL data engineer to grab and use to train their models. The article which included the image below talks about a new dataset of millions of webpages.
,Google photo search is scanning the web and has access to any photo posted to use for training data. Facebook, IG, and others have millions of photos that people are posting online every day, many of which are tagged, with information identifying people in the photos. I’m sure some where there’s a clause in a license agreement that says your photos, when posted on our app, no longer belong to you alone.
As security cameras become more pervasive, camera data will readily be used to train even more advanced facial recognition models without your say so, approval or even appreciation that it is happening. And this is in the first world, with data privacy and identity security protections paramount, imagine how the rest of the world’s data will be used.
With AI DL models, it’s all about the data. Yes much of it is messy and has to be cleaned up, massaged and sometimes annotated to be useful for DL training. But the origins of that training data are typically not disclosed to the AI data engineers nor the people that created it.
We all thought China would have a lead in AI DL because of their unfettered access to data, but the west has its own way to gain unconstrained access to vast amounts of data. And we are living through it today.
Yes AI DL models have the potential to drastically help the world, humanity and government do good things better. But a dark side to AI DL models also exist to help bad actors, organizations and even some government agencies do evil.
Read an article the other day about a new collaboration data platform, the ZooArchNet, for archeological and zoological data ( data about animals and the history of humankind).
The collaboration was started at the Florida Museum at the University of Florida. They intend to construct a database that would allow researchers to track the history of animals and how humans have interacted with them over time.
The problem is there’s a lot of historical animal specimen information available in various locations/sites around the world and similarly, there’s a lot of data about the history of humanity, but there’s little that cross links the two. And by missing those cross links, researchers aren’t seeing the big picture, that humankind and animal-kind have co-existed since the dawn of time and have impacted each other throughout their history.
However, if there was a site where one could trace the history of animal and human life, across time, in a region, one could develop a better understanding of how they interact and impact one another.
Humankind interacting with animalkind
In the article, they discuss a number of examples where animals have been impacted by humankind over time. For example, originally the Mexican Turkey was domesticated for its feathers during Mayan, Aztec and other civilizations of Central America,. but over time it became a prized for use as food. While this was occurring, its range expanded considerably throughout North (and South) America.
It’s the understanding of habitat range over time and how humankind helped or hindered this range that’s best served by linking zoological and archeological data sets that exist in research libraries throughout the world.
How it works
One problem in cross linking such data is that it often exists in different formats and uses different metadata to describe it.
A key, early decision was to use a standard metadata format ,the Darwin Core (DwC) an outgrowth of the Dublin Core which is more focused on zoological data.
With this in place, the next problem was to translate specimen metadata into the DwC and extract the actual data (or URI’s) that described the specimen for harvesting. Once all that was accomplished they could migrate the specimen data or archeological data and host it/cross link it in their ZooArchNet database.
Once the data was available and located in Google Cloud storage, researchers could use Google BigQuery data analytics as well as other apps like (Google) indexers to create more data rich views and searchable indices for their ZooArchNet and VertNet web portals.
ZooArchNet is just starting. Most of the information currently available is about the few examples chosen to demonstrate the technology. As with anything like this, there’s a certain amount of crowd sourcing needed to make it worthwhile. It’s popularity will be a prime determinant on its usefulness over time. But anything that helps the world understand the true history of humanity’s impact on this life of this planet is worthwhile.
As you already know, Optane SSDs have been on the market now for at least a year or so and have not gained much market traction as of yet. I and others attribute this to the high price differential between Optane SSDs and NVMe Flash SSDs but others may say it’s a matter of production yields – probably a little of both.
But Optane, as announced, always had another form factor (if that’s the right term), as persistent memory that could dramatically increase the size of server memory to support new memory intensive applications at a lower price than DRAM.
I was at Nutanix .NEXT conference last week and saw a 4 socket server from DELL that had 6TB of DRAM in it (and 4-44 core CPUs). I didn’t ask the price but when I mentioned I wanted one for my home office, they said it could easily heat my house. So the other problem with a lot of DRAM is power consumption.
Optane DC PM (data center persistent) memory is intended to solve both the high cost and high power consumption problems of DRAM.
How does it work in a server
The newer Intel motherboards support up to 12 slots of memory per socket. And up to 6 of these can be Optane DC PM (512GB DIMM) or 3TB per socket. Optane DC PM is accessed via L1-L2 caching just like any other memory. Apparently with a dual socket system you can have up to 11 Optane DC PM DIMMs on the motherboard.
L1-L2 cache access times are on the order of picoseconds (10**[-12] seconds), DRAM is on the order of nanoseconds (10**[-9] seconds) and flash is on the order of 100 microseconds (100*10**[-6] seconds). So there’s a vast access time gulf between DRAM and Flash that could be exploited with the right technology – enter Optane DC PM.
The only detailed info I could find on Optane DC PM access times was in a research paper (see Basic performance of Intel Optane DC PMM research paper) and it said Optane DC PM assessing times are ~350 nanoseconds, or close to right between DRAM and Flash. At the show the development team indicated that Optane DC PM support about 3GB/sec of bandwidth per module (DIMM).
There are two ways to use Optane DC PM:
Memory mode – in Memory mode, the data in Optane DC PM is thrown away during a power cycle. You must use a block of DRAM as a cache or rather a virtual memory block to the Optane DC PM acting as a paging store. Data is brought into the DRAM cache when accessed using its (virtual) DRAM address and when no longer used. it gets evicted (destaged) back out to Optane DC PM. When power is cycled the data in Optane DC PM is cleared out. Optane DC PM supports AES XTS-256 bit encryption and can easily be cleared by throwing away encryption keys during a power cycle.
App Direct mode – in App Direct mode, Optane DC PM is accessed directly using application APIs and its data persists across power cycles. For App Direct mode, Optane DC PM is still AES 256 encrypted but here the encryption key is maintained across power cycles but is locked on power up and you need to use a pass phrase to unlock it. In this mode, applications are responsible for flushing (L1-L2) caches to Optane to retains all data written through L1-L2 to the Optane DC PM. There’s a GitHub Persistent Memory Development Kit (PMDK) libraryfor that supports the App Direct mode API that applications need to use.
Both modes use DDR-T, (transactional DDR4) a new memory bus protocol for Optane DC PM access. In the DDR-T protocol, access to the memory bus is requested by a CPU and is granted by an Optane DC PM DIMM. All Optane DC PM DIMMs on a system can be accessed in parrallel.
You can use RDMA to replicate (App direct?) Optane DC PM data from one system to another. In order to support Memory and App Direct mode, Optane DC PM required CPU, BIOS and (application) software changes.
Most of the Optane DC PM support and cryptology logic is implemented in hardware. Optane DC PM has an address indirection table (AIT) to support 3D XPoint wear leveling maintained in DRAM but flushed to Optane during power loss. Transfers to 3D XPoint media is in 256 byte cache lines but the memory bus operates in 64 byte cache lines, so there’s a (DRAM) buffer between media and memory bus.
Optane also supports a high availability, or up to two device failure mode. In this scenario, if one Optane DC PM DIMM fails, the system can swap another spare Optane DC PM DIMM into that address space and continue to operate. If a 2nd Optane DC PM fails then the system fails. Not sure what happens to the data on the original Optane DC PM DIMM during a failure. It seems to me data would be lost, but it could depend on its failure mode.
In Memory mode, the expected ratio between DRAM size and Optane DC PM size is should be 32GB DRAM/6TB Optane DC PM. At the TFDx event, the Optane DC PM team had some performance charts showing different DRAM cache miss rates. Intel also announced new CPU monitoring statistics to track application/workloads impacting DRAM/Optane DC PM in Memory mode and to track Optane DC PM health.
Optane DC PM usage modes are established by the BIOS. It’s flexible enough to have the Optane DC PM usage modes be defined on a region by region basis. Not exactly sure what a region is, but it could be an address range spanning MB(?) of Optane DC PM. With both modes in operation on a system, data can be moved from Memory mode Optane to App direct mode Optane or vice versa.
Intel expects that lifetime of an Optane DC PM DIMM to be from 200-370PB of data writes. Optane DC PMs have a 5 year warrantee. Given its bandwidth (3GB/sec), 200PB of data writes should last ~2 years but that’s at 100% duty cycle, writing 3GB of data, every second of every day. So, 5 years should be a reasonable guarantee using a more realistic ~40% duty cycle.
What applications use Optane DC PM
The one of interest to most people seems to be SAP HANA. According to the development team, SAP HANA could use App Direct mode for main database storage and use DRAM for its delta column store. Cassandra could also use Optane in App Direct mode in a similar fashion.
Also something like a REDIS for key-value store could use Optane DC PM to store Values and use DRAM to store Keys.
But any application could take advantage of the extra memory made available with Optane DC PM DIMMs in Memory mode today. Of course any use of Optane DC PM would require the right levels of Intel Xeon CPUs (Cascade Lake), BIOSes and motherboards.
Interested in learning more, TFDx videos of the event are available on the website noted previously. Also these TFDx bloggers also have posts specifically on Optane DC PM.