Data and code versioning For MLops

Read an interesting article (Ex-Apple engineers raise … data storage startup) and research paper (Git is for data) about a of group of ML engineers from Apple forming a new “data storage” startup targeted at MLOps teams just like Apple. It turns out that MLops has some very unique data requirements that go way beyond just data storage.

The paper discusses some of the unusual data requirements for MLOps such as:

  • Infrequent updates – yes there are some MLOps datasets where updates are streamed in but the vast majority of MLOps datasets are updated on a slower cadence. The authors think monthly works for most MLOps teams
  • Small changes/lots of copies – The changes to MLOps data are relatively small compared to the overall dataset size and usually consist of data additions, record deletions, label updates, etc. But uncommon to most data, MLOps data are often subsetted or extracted into smaller datasets used for testing, experimentation and other “off-label” activities.
  • Variety of file types – depending on the application domain, MLOps file types range all over the place. But there’s often a lot of CSV files in combination with text, images, audio, and semi-structured data (DICOM, FASTQ, sensor streams, etc.). However within a single domain, MLOps file types are pretty much all the same.
  • Variety of file directory trees – this is very MLOps team and model dependent. Usually there are train/validate/test splits to every MLOps dataset but what’s underneath each of these can vary a lot and needs to be user customizable.
  • Data often requires pre-processing to be cleansed and made into something appropriate and more useable by ML models
  • Code and data must co-evolve together, over time – as data changes, the code that uses them change. Adding more data may not cause changes to code but models are constantly under scrutiny to improve performance, accuracy or remove biases. Bias elimination often requires data changes but code changes may also be needed.

It’s that last requirement, MLOps data and code must co-evolve and thus, need to be versioned together that’s most unusual. Data-code co-evolution is needed for reproducibility, rollback and QA but also for many other reasons as well.

In the paper they show a typical MLOps data pipeline.

Versioning can also provide data (and code) provenance, identifying the origin of data (and code). MLOps teams undergoing continuous integration need to know where data and code came from and who changed them. And as most MLOps teams collaborate in the development, they also need a way to identify data and code conflicts when multiple changes occur to the same artifact.

Source version control

Code has had this versioning problem forever and the solution became revision control systems (RCS) or source version control (SVC) systems. The most popular solutions for code RCS are Git (software) and GitHub (SaaS). Both provide repositories and source code version control (clone, checkout, diff, add/merge, commit, etc.) as well as a number of other features that enable teams of developers to collaborate on code development.

The only thing holding Git/GitHub back from being the answer to MLOps data and code version control is that they don’t handle large (>1MB) files very well.

The solution seems to be adding better data handling capabilities to Git or GitHub. And that’s what XetHub has created for Git.

XetHub’s “Git is Data” paper (see link above) explains what they do in much detail, as to how they provide a better data layer to Git, but it boils down to using Git for code versioning and as a metadata database for their deduplicating data store. They are using a Merkle trees to track the chunks of data in a deduped dataset.

How XetHub works

XetHub support (dedupe) variable chunking capabilities for their data store. This allows them to use relatively small files checked into Git to provide the metadata to point to the current (and all) previous versions of data files checked into the system.

Their mean chunk size is ~4KB. Data chunks are stored in their data store. But the manifest for dataset versions is effectively stored in the Git repository.

The paper shows how using a deduplicated data store can support data versioning.

XetHub uses a content addressable store (CAS) to store the file data chunk(s) as objects or BLOBs. The key to getting good IO performance out of such a system is to have small chunks but large objects.

They map data chunks to files using a CDMT (content defined merkle tree[s]). Each chunk of data resides in at least two different CDMTs, one associated with the file version and the other associated with the data storage elements.

XetHub’s variable chunking approach is done using a statistical approach and multiple checksums but they also offer one specialized file type chunking for CSV files. As it is, even with their general purpose variable chunking method, they can offer ~9X dedupe ratio for text data (embeddings).

They end up using Git commands for code and data but provide hooks (Git filters) to support data cloning, add/checkin, commits, etc.). So they can take advantage of all the capabilities of Git that have grown up over the years to support code collaborative development but use these for data as well as code.

In addition to normal Git services for code and data, XetHub also offers a read-only, NFSv3 file system interface to XetHub datases. Doing this eliminates having to reconstitute and copy TB of data from their code-data repo to user workstations. With NFSv3 front end access to XetHub data, users can easily incorporate data access for experimentation, testing and other uses.

Results from using XetHub

XetHub showed some benchmarks comparing their solution to GIT LFS, another Git based large data storage solution. For their benchmark, they used the CORD-19 (and ArXiv paper, and Kaggle CORD-I9 dataset) which is a corpus of all COVID-19 papers since COVID started. The corpus is updated daily, released periodically and they used the last 50 versions (up to June 2022) of the research corpus for their benchmark.

Each version of the CORD-19 corpus consists of JSON files (research reports, up to 700K each) and 2 large CSV files one with paper information and the other paper (word?) embeddings (a more useable version of the paper text/tables used for ML modeling).

For CORD-19, XetHub are able to store all the 2.45TB of research reports and CSV files in only 287GB of Git (metadata) and datastore data, or with a dedupe factor 8.7X. With XetHub’s specialized CSV chunking (Xet w/ CSV chunking above), the CORD-19 50 versions can be stored in 87GB or with a 28.8X dedupe ratio. And of that 87GB, only 82GB is data and the rest ~5GB is metadata (of which 1.7GB is the merle tree).

In the paper, they also showed the cost of branching this data by extracting and adding one version which consisted of a 75-25% (random) split of a version. This split was accomplished by changing only the two (paper metadata and paper word embeddings) CSV files. Adding this single split version to their code-data repository/datastore only took an additional 11GB of space An aligned split (only partitioning on a CSV record boundary, unclear but presumably with CSV chunking), only added 185KB.

XETHUB Potential Enhancements

XetHub envisions many enhancements to their solution, including adding other specific file type chunking strategies, adding a “time series” view to their NFS frontend to view code/data versions over time, finer granularity data provenance (at the record level rather than at the change level), and RW NFS access to data. Further, XetHub’s dedupe metadata (on the Git repo) only grows over time, supporting updates and deletes to dedupe metadata would help reduce data requirements.

Read the paper to find out more.

Picture/Graphic credit(s):

FAST(HARD) or Slow(soft)AGI takeoff – AGI Part 6

I was listening to a podcast a couple of weeks back and the person being interviewed made a comment that he didn’t believe that AGI would have a fast (hard) take off rather it would be slow (soft). Here’s the podcast John Carmack interviewed by Lex Fridman).

Hard vs. soft takeoff

A hard (fast) takeoff implies a relatively quick transition (seconds, hours, days, or months) between AGI levels of intelligence and super AGI levels of intelligence. A soft (slow) takeoff implies it would take a long time (years, decades, centuries) to go from AGI to super AGI.

We’ve been talking about AGI for a while now and if you want to see more about our thoughts on the topic, check out our AGI posts (in most recent order: AGI part 5, part 4, part 3 (ish), part (2), part (1), and part (0)).

The real problem is that many believe that any AGI that reaches super-intelligence will have drastic consequences for the earth and especially, for humanity. However, this is whole other debate.

The view is that a slow AGI takeoff might (?) allow sufficient time to imbue any and all (super) AGI with enough safeguards to eliminate or minimize any existential threat to humanity and life on earth (see part (1) linked above).

A fast take off won’t give humanity enough time to head off this problem and will likely result in an humanity ending and possibly, earth destroying event.

Hard vs Soft takeoff – the debate

I had always considered AGI would have a hard take off but Carmack seemed to think otherwise. His main reason is that current large transformer models (closest thing to AGI we have at the moment) are massive and take lots of special purpose (GPU/TPU/IPU) compute, lots of other compute and gobs and gobs of data to train on. Unclear what the requirements are to perform inferencing but suffice it to say it should be less.

And once AGI levels of intelligence were achieved, it would take a long time to acquire any additional regular or special purpose hardware, in secret, required to reach super AGI.

So, to just be MECE (mutually exclusive and completely exhaustive) on the topic, the reasons researchers and other have posited to show that AGI will have a soft takeoff, include:

  • AI hardware for training and inferencing AGI is specialized, costly, and acquisition of more will be hard to keep secret and as such, will take a long time to accomplish;
  • AI software algorithmic complexity needed to build better AGI systems is significantly hard (it’s taken 70yrs for humanity to reach todays much less than AGI intelligent systems) and will become exponentially harder to go beyond AGI level systems. This additional complexity will delay any take off;
  • Data availability to train AGI is humongous, hard to gather, find, & annotate properly. Finding good annotated data to go beyond AGI will be hard and will take a long time to obtain;
  • Human government and bureaucracy will slow it down and/or restrict any significant progress made in super AGI;
  • Human evolution took Ms of years to go from chimp levels of intelligence to human levels of intelligence, why would electronic evolution be 6-9 orders of magnitude faster.
  • AGI technology is taking off but the level of intelligence are relatively minor and specialized today. One could say that modern AI has been really going since the 1990s so we are 30yrs in and today have almost good AI chatbots today and AI agents that can summarize passages/articles, generate text from prompts or create art works from text. If it takes another 30 yrs to get to AGI, it should provide sufficient time to build in capabilities to limit super-AGI hard take off.

I suppose it’s best to take these one at a time.

  • Hardware acquisition difficulty – I suppose the easiest way for an intelligent agent to acquire additional hardware would be to crack cloud security and just take it. Other ways may be to obtain stolen credit card information and use these to (il)legally purchase more compute. Another approach is to optimize the current AGI algorithms to run better within the same AGI HW envelope, creating super AGI that doesn’t need any more hardware at all.
  • Software complexity growing – There’s no doubt that AGI software will be complex (although the podcast linked to above, is sub-titled that “AGI software will be simple”). But any sub-AGI agent that can change it’s code to become better or closer to AGI, should be able to figure out how not to stop at AGI levels of intelligence and just continue optimizating until it reaches some wall. i
  • Data acquisition/annotation will be hard – I tend to think the internet is the answer to any data limitations that might be present to an AGI agent. Plus, I’ve always questioned if Wikipedia and some select other databases wouldn’t be all an AGI would need to train on to attain super AGI. Current transformer models are trained on Wikipedia dumps and other data scraped from the internet. So there’s really two answers to this question, once internet access is available it’s unclear that there would be need for anymore data. And, with the data available to current transformers, it’s unclear that this isn’t already more than enough to reach super AGI
  • Human bureaucracy will prohibit it: Sadly this is the easiest to defeat. 1) there are roque governments and actors around the world with more than sufficient resources to do this on their own. And no agency, UN or otherwise, will be able to stop them. 2) unlike nuclear, the technology to do AI (AGI) is widely available to business and governments, all AI research is widely published (mostly open access nowadays) and if anything colleges/universities around the world are teaching the next round of AI scientists to take this on. 3) the benefits for being first are significant and is driving a weapons (AGI) race between organizations, companies, and countries to be first to get there.
  • Human evolution took Millions of years, why would electronic be 6-9 orders of magnitude faster – electronic computation takes microseconds to nanoseconds to perform operations and humans probably 0.1 sec, or so. Electronics is already 5 to 8 orders of magnitude faster than humans today. Yes the human brain is more than one CPU core (each neuron would be considered a computational element). But there are 64 core CPUs/4096 CORE GPUs out there today and probably one could consider similar in nature if taken in the aggregate (across a hyperscaler lets say). So, just using the speed ups above it should take anywhere from 1/1000 of a year to 1 year to cover the same computational evolution as human evolution covered between the chimp and human and accordingly between AGI and AGIx2 (ish).
  • AGI technology is taking a long time to reach, which should provide sufficient time to build in safeguards – Similar to the discussion on human bureaucracy above, with so many actors taking this on and the advantages of even a single AGI (across clusters of agents) would be significant, my guess is that the desire to be first will obviate any thoughts on putting in safeguards.

Other considerations for super AGI takeoff

Once you have one AGI trained why wouldn’t some organization, company or country deploy multiple agents. Moreover, inferencing takes orders of magnitude less computational power than training. So with 1/100-1/1000th the infrastructure, one could have a single AGI. But the real question is wouldn’t a 100- or 1000-AGis represent super intelligence?

Yes and no, 100 humans doesn’t represent super intelligence and a 1000 even less so. But humans have other desires, it’s unclear that 100 humans super focused on one task wouldn’t represent super intelligence (on that task).

Interior view of a data center with equipment

What can be done to slow AGI takeoff today

Baring something on the order of Nuclear Proliferation treaties/protocols, putting all GPUs/TPUs/IPUs on weapons export limitations AND restricting as secret, any and all AI research, nothing easily comes to mind. Of course Nuclear Proliferation isn’t looking that good at the moment, but whatever it’s current state, it has delayed proliferation over time.

One could spend time and effort slowing technology progress down. Such as by reducing next generation CPU/GPU/IPU compute cores , limiting compute speedups, reduce funding for AI research, putting a compute tax, etc. All of which, if done across the technological landscape and the whole world, could give humanity more time to build in AGI safeguards. But doing so would adversely impact all technological advancement, in healthcare, business, government, etc. And given the proliferation of current technology and the state actors working on increasing capabilities to create more, it would be hard to envision slowing technological advancement down much, if at all.

It’s almost like putting a tax on slide rules or making their granularity larger.

It could be that super AGI would independently perceive itself benignly, and only provide benefit to humanity and the earth. But, my guess is that given the number of bad actors intent on controlling the world, even if this were true, they would try to (re-)direct it to harm segments of humanity/society. And once unleashed, it would be hard to stop.

The only real solution to AGI in bad actor hands, is to educate all of humanity to value all humans and to cherish the environment we all live in as sacred. This would eliminate bad actors,

It sounds so naive, but in reality, it’s the only thing, I believe, the only way we can truly hope to get us through this AGI technological existential crisis.

Just like nuclear, we as a society will keep running into technological existential crisis’s like this. Heading all these off, with a better more all inclusive, more all embracing, and less combative humanity could help all of them.


Picture Credits:

The Hollowing out of enterprise IT

We had a relatively long discussion yesterday, amongst a bunch of independent analysts and one topic that came up was my thesis that enterprise IT is being hollowed out by two forces pulling in opposite directions on their apps. Those forces are the cloud and the edge.

Western part of the abandoned Packard Automotive Plant in Detroit, Michigan. by Albert Duce

Cloud sirens

The siren call of the cloud for business units, developers and modern apps has been present for a long time now. And their call is more omnipresent than Odysseus ever had to deal with.

The cloud’s allure is primarily low cost-instant infrastructure that just works, a software solution/tool box that’s overflowing, with locations close to most major metropolitan areas, and the extreme ease of starting up.

If your app ever hopes to scale to meet customer demand, where else can you go. If your data can literally come in from anywhere, it usually lands on the cloud. And if you have need for modern solutions, tools, frameworks or just about anything the software world can create, there’s nowhere else with more of this than the cloud.

Pre-cloud, all those apps would have run in the enterprise or wouldn’t have run at all. And all that data would have been funneled back into the enterprise.

Not today, the cloud has it all, its siren call is getting louder everyday, ever ready to satisfy every IT desire anyone could possibly have, except for the edge.

The Edge, last bastion for onsite infrastructure

The edge sort of emerged over the last decade or so kind of in stealth mode. Yes there were always pockets of edge, with unique compute or storage needs. For example, video surveillance has been around forever but the real acceleration of edge deployments started over the last decade or so as compute and storage prices came down drastically.

These days, the data being generated is stagering and compute requirements that go along with all that data are all over the place, from a few ARMv/RISC V cores to a server farm.

For instance, CERN’s LHC creates a PB of data every second of operation (see IEEE Spectrum article, ML shaking up particle physics too). But they don’t store all that. So they use extensive compute (and ML) to try to only store interesting events.

Seismic ships roam the seas taking images of underground structures, generating gobs of data, some of which is processed on ship and the rest elsewhere. A friend of mine creates RPi enabled devices that measure tank liquid levels deployed in the field.

More recently, smart cars are like a data center on tires, rolling across roads around the world generating more data than you want can even imagine. 5G towers are data centers ontop of buildings, in farmland, and in cell towers doting the highways of today. All off the beaten path, and all places where no data center has ever gone before.

In olden days there would have been much less processing done at the edge and more in an enterprise data center. But nowadays, with the advent of relatively cheap computing and storage, data can be pre-processed, compressed, tagged all done at the edge, and then sent elsewhere for further processing (mostly done in the cloud of course).

IT Vendors at the crossroads

And what does the hollowing out of the enterprise data centers mean for IT server and storage vendors, mostly danger lies ahead. Enterprise IT hardware spend will stop growing, if it hasn’t already, and over time, shrink dramatically. It may be hard to see this today, but it’s only a matter of time.

Certainly, all these vendors can become more cloud like, on prem, offering compute and storage as a service, with various payment options to make it easier to consume. And for storage vendors, they can take advantage of their installed base by providing software versions of their systems running in the cloud that allows for easier migration and onboarding to the cloud. The server vendors have no such option. I see all the above as more of a defensive, delaying or holding action.

This is not to say the enterprise data centers will go away. Just like, mainframe and tape before them, on prem data centers will exist forever, but will be relegated to smaller and smaller, niche markets, that won’t grow anymore. But, only as long as vendor(s) continue to upgrade technology AND there’s profit to be made.

It’s just that that astronomical growth, that’s been happening ever since the middle of last century, happen in enterprise hardware anymore.

Long term life for enterprise vendors will be hard(er)

Over the long haul, some server vendors may be able to pivot to the edge. But the diversity of compute hardware there will make it difficult to generate enough volumes to make a decent profit there. However, it’s not to say that there will be 0 profits there, just less. So, when I see a Dell or HPE server, under the hood of my next smart car or inside the guts of my next drone, then and only then, will I see a path forward (or sustained revenue growth) for these guys.

For enterprise storage vendors, their future prospects look bleak in comparison. Despite the data generation and growth at the edge, I don’t see much of a role for them there. The enterprise class feature and functionality, they have spent the decades creating and nurturing aren’t valued as much in the cloud nor are they presently needed in the edge. Maybe I’m missing something here, but I just don’t see a long term play for them in the cloud or edge.


For the record, all this is conjecture on my part. But I have always believed that if you follow where new apps are being created, there you will find a market ready to explode. And where the apps are no longer being created, there you will see a market in the throws of a slow death.

Photo Credit(s):

Safe AI

I’ve been writing about AGI (see part-0 [ish]part-1 [ish]part-2 [ish]part-3ish, part-4 and part 5) and the dangers that come with it (part-0 in the above list) for a number of years now. My last post on the subject I expected to be writing a post discussing the book Human compatible AI and the problem of control which is a great book on the subject. But since then I ran across another paper that perhaps is a better brief introduction into the topic and some of the current thought and research into developing safe AI.

The article I found is Concrete problems in AI, written by a number of researchers at Google, Stanford, Berkley, and OpenAI. It essentially lays out the AI safety problem in 5 dimensions and these are:

Avoiding negative side effects – these can be minor or major and is probably the one thing that scares humans the most, some toothpick generating AI that strips the world to maximize toothpick making.

Avoiding reward hacking – this is more subtle but essentially it’s having your AI fool you in that it’s doing what you want but doing something else. This could entail actually changing the reward logic itself to being able to convince/manipulate the human overseer into seeing things it’s way. Also a pretty bad thing from humanity’s perspective

Scalable oversight – this is the problem where human(s) overseers aren’t able to keep up and witness/validate what some AI is doing, 7×24, across the world, at the speed of electronics. So how can AI be monitored properly so that it doesn’t go and do something it’s not supposed to (see the prior two for ideas on how bad this could be).

Safe exploration – this is the idea that reinforcement learning in order to work properly has to occasionally explore a solution space, e.g. a Go board with moves selected at random, to see if they are better then what it currently believes are the best move to make. This isn’t much of a problem for game playing ML/AI but if we are talking about helicopter controlling AI, exploration at random could destroy the vehicle plus any nearby structures, flora or fauna, including humans of course.

Robustness to distributional shifts – this is the perrennial problem where AI or DNNs are trained on one dataset but over time the real world changes and the data it’s now seeing has shifted (distribution) to something else. This often leads to DNNs not operating properly over time or having many more errors in deployment than it did during training. This is probably the one problem in this list that is undergoing more research to try to rectify than any of the others because it impacts just about every ML/AI solution currently deployed in the world today. This robustness to distributional shifts problem is why many AI DNN systems require periodic retraining.

So now we know what to look for, now what

Each of these deserves probably a whole book or more to understand and try to address. The paper talks about all of these and points to some of the research or current directions trying to address them.

The researchers correctly point out that some of the above problems are more pressing when more complex ML/AI agents have more autonomous control over actions in the real world.

We don’t want our automotive automation driving us over a cliff just to see if it’s a better action than staying in the lane. But Go playing bots or article summarizers might be ok to be wrong occasionally if it could lead to better playing bots/more concise article summaries over time. And although exploration is mostly a problem during training, it’s not to say that such activities might not also occur during deployment to probe for distributional shifts or other issues.

However, as we start to see more complex ML AI solutions controlling more activities, the issue of AI safety are starting to become more pressing. Autonomous cars are just one pressing example. But recent introductions of sorting robots, agricultural bots, manufacturing bots, nursing bots, guard bots, soldier bots, etc. are all just steps down a -(short) path of increasing complexity that can only end in some AGI bots running more parts (or all) of the world.

So safety will become a major factor soon, if it’s not already

Scares me the most

The first two on the list above scare me the most. Avoiding negative or unintentional side effects and reward hacking.

I suppose if we could master scalable oversight we could maybe deal with all of them better as well. But that’s defense. I’m all about offense and tackling the problem up front rather than trying to deal with it after it’s broken.

Negative side effects

Negative side effects is a rather nice way of stating the problem of having your ML destroy the world (or parts of it) that we need to live.

One approach to dealing with this problem is to define or train another AI/ML agent to measure impacts the environment and have it somehow penalize the original AI/ML for doing this. The learning approach has some potential to be applied to numerous ML activities if it can be shown to be safe and fairly all encompassing.

Another approach discussed in the paper is to inhibit or penalize the original ML actions for any actions which have negative consequences. One approach to this is to come up with an “empowerment measure” for the original AI/ML solution. The idea would be to reduce, minimize or govern the original ML’s action set (or potential consequences) or possible empowerment measure so as to minimize its ability to create negative side effects.

The paper discusses other approaches to the problem of negative side effects, one of which is having multiple ML (or ML and human) agents working on the problem it’s trying to solve together and having the ability to influence (kill switch) each other when they discover something’s awry. And the other approach they mention is to reduce the certainty of the reward signal used to train the ML solution. This would work by having some function that would reduce the reward if there are random side effects, which would tend to have the ML solution learn to avoid these.

Neither of these later two seem as feasible as the others but they are all worthy of research.

Reward hacking

This seems less of a problem to our world than negative side effects until you consider that if an ML agent is able to manipulate its reward code, it’s probably able to manipulate any code intending to limit potential impacts, penalize it for being more empowered or manipulate a human (or other agent) with its hand over the kill switch (or just turn off the kill switch).

So this problem could easily lead to a break out of any of the other problems present on the list of safety problems above and below. An example of reward hacking is a game playing bot that detects a situation that leads to buffer overflow and results in win signal or higher rewards. Such a bot will no doubt learn how to cause more buffer overflows so it can maximize its reward rather than learn to play the game better.

But the real problem is that a reward signal used to train a ML solution is just an approximation of what’s intended. Chess programs in the past were trained by masters to use their opening to open up the center of the board and use their middle and end game to achieve strategic advantages. But later chess and go playing bots just learned to checkmate their opponent and let the rest of the game take care of itself.

Moreover, (board) game play is relatively simple domain to come up with proper reward signals (with the possible exception of buffer overflows or other bugs). But car driving bots, drone bots, guard bots, etc., reward signals are not nearly as easy to define or implement.

One approach to avoid reward hacking is to make the reward signaling process its own ML/AI agent that is (suitably) stronger than the ML/AI agent learning the task. Most reward generators are relatively simple code. For instance in monopoly, one that just counts the money that each player has at the end of the game could be used to determine the winner (in a timed monopoly game). But rather than having a simple piece of code create the reward signal use ML to learn what the reward should be. Such an agent might be trained to check to see if more or less money was being counted than was physically possible in the game. Or if property was illegally obtained during the game or if other reward hacks were done. And penalize the ML solution for these actions. These would all make the reward signal depend on proper training of that ML solution. And the two ML solutions would effectively compete against one another.

Another approach is to “sandbox” the reward code/solution so that it is outside of external and or ML/AI influence. Possible combining the prior approach with this one might suffice.

Yet another approach is to examine the ML solutions future states (actions) to determine if any of them impact the reward function itself and penalize it for doing this. This assumes that the future states are representative of what it plans to do and that some code or some person can recognize states that are inappropriate.

Another approach discussed in the paper is to have multiple reward signals. These could use multiple formulas for computing the multi-faceted reward signal and averaging them or using some other mathematical function to combine them into something that might be more accurate than one reward function alone. This way any ML solution reward hacking would need to hack multiple reward functions (or perhaps the function that combines them) in order to succeed.

The one IMHO that has the most potential but which seems the hardest to implement is to somehow create “variable indifference” in the ML/AI solution. This means having the ML/AI solution ignore any steps that impact the reward function itself or other steps that lead to reward hacking. The researchers rightfully state that if this were possible then many of the AI safety concerns could be dealt with.

There are many other approaches discussed and I would suggest reading the paper to learn more. None of the others, seem simple or a complete solution to all potential reward hacks.


The paper goes into the same or more level of detail with the other three “concrete safety” issues in AI.

In my last post (see part 5 link above) I thought I was going to write about Human Compatible (AI) by S. Russell book’s discussion AI safety. But then I found the “Concrete problems in AI safety paper (see link above) and thought it provided a better summary of AI safety issues and used it instead. I’ll try to circle back to the book at some later date.

Photo Credit(s):

AI navigation goes with the flow

Read an article the other day (Engineers Teach AI to Navigate Ocean with Minimal Energy) about a simulated robot that was trained to navigate 2D turbulent water flow to travel between locations. They used a combination reinforcement learning with a DNN derived policy. The article was reporting on a Nature Communications open access paper (Learning efficient navigation in vortical flow fields).

The team was attempting to create an autonomous probe that could navigate the ocean and other large bodies of water to gather information. I believe ultimately the intent was to provide the navigational smarts for a submersible that could navigate terrestrial and non-terrestrial oceans.

One of the biggest challenges for probes like this is to be able to navigate turbulent flow without needing a lot of propulsive power and using a lot of computational power. They said that any probe that could propel itself faster than the current could easily travel wherever it wanted but the real problem was to go somewhere with lower powered submersibles.. As a result, they set their probe to swim at a constant speed at 80% of the overall simulated water flow.

Even that was relatively feasible if you had unlimited computational power to train and inference with but trying to do this on something that could fit in a small submersible was a significant challenge. NLP models today have millions of parameters and take hours to train with multiple GPU/CPU cores in operation and lots of memory Inferencing using these NLP models also takes a lot of processing power.

The researchers targeted the computational power to something significantly smaller and wished to train and perform real time inferencing on the same hardware. They chose a “Teensy 4.0 micro-controller” board for their computational engine which costs under $20, had ~2MB of flash memory and fit in a space smaller than 1.5″x1.0″ (38.1mm X 25.4mm).

The simulation setup

The team started their probe turbulent flow training with a cylinder in a constant flow that generated downstream vortices, flowing in opposite directions. These vortices would travel from left to right in the simulated flow field. In order for the navigation logic to traverse this vortical flow, they randomly selected start and end locations on different sides.

The AI model they trained and used for inferencing was a combination of reinforcement learning (with an interesting multi-factor reward signal) and a policy using a trained deep neural network. They called this approach Deep RL.

For reinforcement learning, they used a reward signal that was a function of three variables: the time it took, the difference in distance to target and a success bonus if the probe reached the target. The time variable was a penalty and was the duration of the swim activity. Distance to target was how much the euclidean distance between the current probe location and the target location had changed over time. The bonus was only applied when the probe was in close proximity to the target location, The researchers indicated the reward signal could be used to optimize for other values such as energy to complete the trip, surface area traversed, wear and tear on propellers, etc.

For the reinforcement learning state information, they supplied the probe and the target relative location [Difference(Probe x,y, Target x,y)], And whatever sensor data being tested (e.g., for the velocity sensor equipped probe, the local velocity of the water at the probe’s location).

They trained the DNN policy using the state information (probe start and end location, local velocity/vorticity sensor data) to predict the swim angle used to navigate to the target. The DNN policy used 2 internal layers with 64 nodes each.

They benchmarked the Deep RL solution with local velocity sensing against a number of different approaches. One naive approach that always swam in the direction of the target, one flow blind approach that had no sensors but used feedback from it’s location changes to train with, one vorticity sensor approach which sensed the vorticity of the local water flow, and one complete knowledge approach (not shown above) that had information on the actual flow at every location in the 2D simulation

It turned out that of the first four (naive, flow-blind, vorticity sensor and velocity sensor) the velocity sensor configured robot had the highest success rate (“near 100%”).

That simulated probe was then measured against the complete flow knowledge version. The complete knowledge version had faster trip speeds, but only 18-39% faster (on the examples shown in the paper). However, the knowledge required to implement this algorithm would not be feasible in a real ocean probe.

More to be done

They tried the probes Deep RL navigation algorithm on a different simulated flow configuration, a double gyre flow field (sort of like 2 circular flows side by side but going in the opposite directions).

The previously trained (on cylinder vortical flow) Deep RL navigation algorithm only had a ~4% success rate with the double gyre flow. However, after training the Deep RL navigation algorithm on the double gyre flow, it was able to achieve a 87% success rate.

So with sufficient re-training it appears that the simulated probe’s navigation Deep RL could handle different types of 2D water flow.

The next question is how well their Deep RL can handle real 3D water flows, such as idal flows, up-down swells, long term currents, surface wind-wave effects, etc. It’s probable that any navigation for real world flows would need to have a multitude of Deep RL trained algorithms to handle each and every flow encountered in real oceans.

However, the fact that training and inferencing could be done on the same small hardware indicates that the Deep RL could possibly be deployed in any flow, let it train on the local flow conditions until success is reached and then let it loose, until it starts failing again. Training each time would take a lot of propulsive power but may be suitable for some probes.

The researchers have 3D printed a submersible with a Teensy microcontroller and an Arduino controller board with propellers surrounding it to be able to swim in any 3D direction. They have also constructed a water tank for use for in real life testing of their Deep RL navigation algorithms.

Picture credit(s):

BEHAVIOR, an in-home robot, benchmark

As my readers probably already know, I’m a long time benchmark geek. So when I recently read an article out of Stanford (AI Experts Establish the “North Star” for Domestic Robotics Field) where a research team there developed a new robotic benchmark, I was interested. The new robotics benchmark is called BEHAVIOR which was documented in an article (see: BEHAVIOR: Benchmark for Everyday Household Activities in Virtual, Interactive, and ecOlogical enviRonments). It essentially uses real world data to identify domestic work activities that any robot would need to perform in a home.

The problems with robot benchmarks

The problem with benchmarks are multi-faceted:

  • How realistic are the workloads used to evaluate the systems being measured?
  • How accurate are the metrics used to rank and judge benchmark submissions?
  • How costly/complex is it to run a benchmark?
  • How are submissions audited and are they reproducible?.
  • Where are benchmark results reported and are they public?

And of course robotics brings in it’s own issues that makes benchmarking more difficult:

  • What sensors does the robot have to understand how to complete tasks?
  • What manipulators does the robot have to perform the tasks required of it?
  • Do the robots move in the environment and if so, how do the robots move?
  • Does the robot perform the task in the real world on in a simulated environment.

And of course, when using a simulated environment, how realistic is it.

BEHAVIOR with iGibson (see below) seem to answer many of these concerns for an in home robot benchmarking.


First, BEHAVIOR’s home making tasks were selected from an American Time Use Survey maintained by the USA Bureau of Labor Statistics which identifies tasks Americans perform in their homes. With BEHAVIOR 1.0 there are 100 tasks ranging from building a fruit basket to cleaning a toilet, and just about everything in between. I didn’t see any cooking or mixing drinks tasks but maybe those will be added.

Second, BEHAVIOR uses a predicate logic, called BDDL (BEHAVIOR Domain Definition Language) to define initial conditions for tasks such as tables, chairs, books, etc located in the room, where objects need to be placed, and successful completion goals or what task completion should look like.

BEHAVIOR uses 15 different rooms or scenes in their benchmark, such as a kitchen, garage, study, etc. Each of the 100 tasks are performed in a specific room.

BEHAVIOR incorporates 1217 different objects in 391 categories. Once initial conditions are defined for a task, BEHAVIOR essentially randomly selects different object for the task and randomly locates them throughout the room.

In order to run the benchmark, one could conceivably create a real room, with all the objects and have them placed according to BEHAVIOR BDDL’s randomly assigned locations with a robot physically present in the room and have it perform the assigned task OR one could use a simulation engine and have the robot run the task in the simulation environment, with simulated room, objects and robot.

It appears as if BEHAVIOR could operate in any robotics simulation environment but has been currently implemented in Stanford’s open source robotics simulation engine called iGibson 2.0 (see: iGibson 2.0: Object-Centric Simulation for Robot Learning of Everyday Household Tasks and iGibson 2.0 website). iGibson uses the Bullet real time physics engine for realistic physical environment simulation.

A robot operating within iGibson is provided a 3D rendering of the room and objects in images or LIDAR sensor scans. It can then identify the objects that it needs to manipulate to perform the tasks. One can define the robot simulated sensors and manipulators in iGibnot 2.0 and it’s written in Python, is open source (GitHub Repo) and can be installed to run on (Ubuntu 16.04) Linux, Windows (10) or Mac (10.15) systems.

Finally, BEHAVIOR uses a set of metrics to determine how well a robot has performed its assigned task. Their first metric is success score defined as the fraction of goal conditions satisfied by the robot performing the task. Such as the number of dishes properly cleaned and placed in the drying rack divided by the total number of dishes for a “washing dishes” task. And their second metric is a set of efficiency metrics, like time to complete a task, sum total of object distance moved during the task, how well objects are arranged at task completion (is the toilet seat down…), etc.

Another feature of iGibson 2.0 is that it offers the ability to record a human (in VR) doing a task in its simulated environment. So if your robotic system is able to learn by example, then iGibson could be used to provide training data for an activity.


A couple of additions to the BEHAVIOR benchmark/iGibson simulation environment that I would like to see:

  • There ought to be a way to construct a house/apartment where multiple rooms are arranged in a hierarchy, i.e., rooms associated with floors with connections using hallways, doors, stairs, etc. between them. This way one could conceivably have a define a set of homes/apartments (let’s say 5) that a robot would perform its tasks in.
  • They need a task list to drive robot activities. Assume that there’s some amount of time let’s say 8-12 hours that a robot is active and construct a series of tasks that need to be accomplished during that period.
  • Robots should be placed in the rooms/apartments/homes at random with random orientation and then they would have to navigate through rooms/passageways to the rooms to perform the tasks.
  • They need to add pet/human avatars in the rooms throughout a home. These would represent real time obstacles to task completion/navigation as well as add more tasks associated with caring for pets/humans.
  • They need the ability to add non-home rooms that could encompass factory floors, emergency response debris fields, grocery stores, etc. and their own unique set of tasks for each of these so that it could be used as a benchmark for more than just domestic robots.

Aside from the above additions to BEHAVIOR/iGibson 2.0, there’s the question of the organization that manages the benchmark and submissions. There needs to be a website/place to publish benchmark results for a robot AND a mechanism to audit results for accuracy to insure fair play.

Typically this would be associated with an organization responsible for publishing and auditing submissions as well as guide further development of BEHAVIOR/iGibson 2.0. BEHAVIOR 1.0 is not the end but it’s a great start at providing realistic tasks that any domestic robot would need to perform. 

Benchmarks have always aided the development and assessment of new technologies. Having a in home robot benchmark like BEHAVIOR makes getting domestic robots that do what we want them to do a more likely possibility someday.

There’s a new benchmark in town and it signals the dawning of the domestic robot age.

Photo Credit(s):

Dell EMC PowerStore X and the Edge – TFDxDell

This past summer I attended a virtual TFDxDell event where there was a number of sessions discussing Dell EMC technologies for the enterprise. One session sort of struck a nerve, the Dell EMC PowerStore session and I have finally figured out what interested me most in their talk, their PowerStore X appliances and AppsON technologies

What is AppsON and PowerStore X appliance?

Essentially PowerStore X with AppsON has an onboard ESXi hypervisor which allows customers to run vSphere VMs inside the storage system with direct vVol (I assume) access to PowerStore data storage without having to go out over a (storage) network.

PowerStore X ESXi is a little behind the most recent VMware vSphere releases (at least 30 days) but it’s current enough for most shops. In non-PowerStore X appliances, PowerStoreOS runs as containers but in PowerStore X, PowerStoreOS storage functionality runs as VMs, just like any other VMs running on its ESXi hypervisor.

Moreover, PowerStore X can still service IOs from other non-PowerStore X resident VMs or bare metal applications running in the environment. In this way you get all the data services of an enterprise class storage system, that also run VMs.

With PowerStore OS 2.0 they have added scale out to AppsON. That is any PowerStore X (1000X, 3000X, 5000X or 7000X) appliance, in a PowerStore X cluster, can have their VMs move from one appliance to another using vSphere vMotion. This means that as your PowerStore X storage clusters grow, you can rebalance VM application workloads across the cluster. A PowerStore X cluster can contain up to 4 PowerStore X appliances.

PowerStore’s heritage goes back quite a ways at Dell and EMC. Prior versions of EMC Unity storage and some of its progenitors had the ability to run applications on the storage itself. But by running an ESXi hypervisor on PowerStore X appliances, it takes all this to a whole new level.

Why would anyone want AppsON?

It’s taken me sometime to understand why anyone would want to use AppsON and I have concluded that the edge might be the best environment to deploy it.

Recent VMware enhancements have reduced minimum node configurations for edge environments to 2 servers. It’s unclear to me whether a single PowerStore X appliance with AppsON is one server or two but, for the moment lets assume its just one. This means that a minimum VMware vSphere edge deployment could use 1 PowerStore X and 1 standalone, ESXi server.

In such an environment, customers could run their data intensive VMs directly on the PowerStore X and some of their non-data intensive VMs on the standalone server. But the flexibility exists to vMotion VMs from one to the other as demand dictates.

But does the edge need storage?

Yes, some do. For instance, take 5G. it enables a whole new class of mobile services and many of them can be quite data intensive. 5G is being deployed around the world as mini-data centers in cell towers. Unclear whether these data centers run vSphere but I’m sure VMware is trying their hardest to make that happen. With vSphere running your 5G mini-datacenter, PowerStore X could make a smart addition.

Then there’s all the smart cars, which are creating TBs of sensor data every time they take to the road. You’re probably not going to have a PowerStore appliance in your smart car (at least anytime soon) but they just might have one at the local service station.

And maybe given all the smart devices in your home, smart cars, smart appliances, smart robots, etc., there’s going to be a whole lot of data generated from your smart home. Having something like PowerStore X in your smart home’s mini-data center would offer a place to hold all that data and to do some processing (compressing maybe) before sending it up to the cloud.


We have just two more questions for Dell EMC,

  1. Shouldn’t the base PowerStore appliance be called PowerStore K?
  2. Shouldn’t customers be allowed to run their own K8s container apps on their PowerStore K just as easily as running VMs in their PowerStore X?

Legal Disclosure: TechFieldDay and Dell provided gifts to all participants (including me) for the TFDxDell event.

Photo credit(s):

  • From Dell EMC slides presented at TFDxDell event
  • From Dell EMC slides presented at TFDxDell event
  • From Dell EMC slides presented at TFDxDell event

Facebook’s (Meta) Kangaroo, a better cache for billions of small objects

Read an article this week in Blocks and Files, Facebook’s Kangaroo jumps over flash limitations which spiked my interest and I went and searched for more info on this and found a fb blog post, Kangaroo: A new flash cache optimized for tiny objects which sent me to an ACM SOSP (Syposium on O/S Principles) best paper of 2021, Caching billions of tiny objects on flash.

First, as you may recall flash has inherent limitations when it comes to writing. The more writes to a flash device the more NAND cells start to fail over time. Flash devices are only rated for some amount (of standard, ~4KB) block writes, For example, the Micron 5300 Max SSD only supports 3-5 (4KB blocks) DWPD (drive writes per day). So, a 2TB Micron Max 5300 SSD can only sustain from ~1.5 to 2.4B 4KB block writes per day. Now that seems more than sufficient for most work but when somebody like fb, using the SSD as a object cache, writes a few billion or more 100B(yte) objects and does this day in or day out, can consume an SSD in no time. Especially if they are writing one 100 B object per block

So there’s got to be a better way to cache small objects into bigger blocks. Their paper talks of two prior approaches:

  • Log structured storage – here multiple 100B objects are stored in a single a 4KB block and iwritten out with one IO rather than 40. This works fairly well but the index ,which maps an object key, to a log location, takes up a lot of memory space. If your caching ~3B 100B objects in a logs and each object index takes 16 bytes that’s a data space of 48GB.
  • Associative set storage – here each object is hashed into a set of (one or more) storage blocks and is stored there. In this case there’s no DRAM index but you do need a quick way to determine if an object is in the set storage or not. This can be done with bloom filters (see: wikipedia article on bloom filters). So if each associative set stores 400 objects and one needs to store 3B objects one needs a 30 MB of bloom filters (assuming 4bytes each). The only problem with associative sets is that when one adds an element to a set. the set has to be rewritten. So if over time you add 400 objects to a set you are writing that set 400 times. All of which eats into the DWPD budget for the flash storage.

In Kangaroo, fb engineers have combined the best of both of these together and added a small DRAM cache.

How does it work?

Their 1st tier is a DRAM cache, which is ~1% of the capacity of the whole object cache. Objects are inserted into the DRAM cache first and are evicted in a least recently used fashion, that is object’s that have not been used in the longest time are moved out of this cache and are written to the next layer (not quite but get to that in a moment).

Their 2nd tier is a log structured system, at ~5% of cache capacity. They call this a KLog and it consists of a ring of 4KB blocks on SSD, with a DRAM index telling where each object is located on the ring.. Objects come in and are buffered together into a 4KB block and are written to the next empty slot in the ring with its DRAM index updated accordingly. Objects are evicted from Klog in such a way that a group of them, that would be located in the same associative set and are LRU, can all be evicted at the same time. They have structured the Klog DRAM index so that it makes finding all these objects easy. Also any log structured system needs to deal with garbage collection, Let’s say you evict 5 objects in a 4K block, that leaves 35 that are still good. Garbage collection will read a number of these partially full blocks and mash all the good objects together leaving free space for new objects that need to be cached.

The 3rd and final tier is a set associative store, they call the Kset that uses bloom filters to show object presence. For this tier, an object’s key is hashed to find a block to put it in, the block is read and the object inserted and the block rewritten. Objects are evicted out of the set associative store based on LRU within a block. The bloom filters are used to determine if the object exists in an set associative block.

There are a few items missing from the above description. As can be seen in Figure 3B above, Kangaroo can jettison objects that are LRUed out of DRAM instead of adding them to the Klog. The paper suggests this can be done purely at random, say only admit, into the Klog, 95% of the objects at random being LRUed from DRAM. The jettison threshold for Klog to Kset is different. Here they will jettison single object sets. That is if there were only one object that would be evicted and written to a set, it’s jettisoned rather than saved in the Kset. The engineers call this a Kset threshold of 2 (indicating minimum number of objects in a single set that can be moved to Kset)..

While understanding an objects LRU is fairly easy if you have a DRAM index element for each block, it’s much harder when there’s no individual object index available, as in Kset.

To deal with tracking LRU in the Kset, fb engineers created a RRIParoo index with a DRAM index portion and a flash resident index portion.

  • RRIParoo’s DRAM index is effectively a 40 byte bit map which contains one bit per object, corresponding to its location in the block. A bit on in this DRAM bitmap indicates that the corresponding object has been referenced since the last time the flash resident index has been re-written. .
  • RRIParoo’s flash resident index contains 3 bit integers, each one corresponding to an object in the block. This integer represents how many clock ticks, it has been since the corresponding object has been referenced. When the need arises to add an object to a full block, the object clock counters in that block’s RRIP flash index are all incremented until one has gotten to the oldest time frame b’111′ or 7. It is this object that is evicted.

New objects are given an arbitrary clock tick count say b’001′ or 1 (as shown in Fig. 6, in the paper they use b’110′ or 6), which is not too high to be evicted right away but not too low to be considered highly referenced.

How well does Kangaroo perform

According to the paper using the same flash storage and DRAM, it can reduce cache miss ratio by 29% over set associative or log structured cache’s alone. They tested this by using simulations of real world activity on their fb social network trees.

The engineers did some sensitivity testing using various Kangaroo algorithm parameters to see how sensitive read miss rates were to Klog admission percentage, RRIParoo flash index element (clock tick counter) size, Klog capacity and Kset admission threshold.

Kangaroo performance read miss rate sensitivity to various algorithm parameters

Applications of the technology

Obviously this is great for Twitter and facebook/meta as both of these deal with vast volumes of small data objects. But databases, Kafka data streams, IoT data, etc all deal with small blocks of data and can benefit from better caching that Kangaroo offers.

Storage could also use something similar only in this case, a) the objects aren’t small and b) the cache is all in memory. DRAM indexes for storage caching, especially when we have TBs of DRAM cache, can be still be significant, especially if an index element is kept for each block in cache. So the technique could also be deployed for large storage caches as well.

Then again, similar techniques could be used to provide caching for multiple tiers of storage. Say DRAM cache, SSD Log cache and SSD associative set cache for data blocks with the blocks actually stored on large disks or QLC/PLC SSDs.

Photo credit(s):