FAST(HARD) or Slow(soft)AGI takeoff – AGI Part 6

I was listening to a podcast a couple of weeks back and the person being interviewed made a comment that he didn’t believe that AGI would have a fast (hard) take off rather it would be slow (soft). Here’s the podcast John Carmack interviewed by Lex Fridman).

Hard vs. soft takeoff

A hard (fast) takeoff implies a relatively quick transition (seconds, hours, days, or months) between AGI levels of intelligence and super AGI levels of intelligence. A soft (slow) takeoff implies it would take a long time (years, decades, centuries) to go from AGI to super AGI.

We’ve been talking about AGI for a while now and if you want to see more about our thoughts on the topic, check out our AGI posts (in most recent order: AGI part 5, part 4, part 3 (ish), part (2), part (1), and part (0)).

The real problem is that many believe that any AGI that reaches super-intelligence will have drastic consequences for the earth and especially, for humanity. However, this is whole other debate.

The view is that a slow AGI takeoff might (?) allow sufficient time to imbue any and all (super) AGI with enough safeguards to eliminate or minimize any existential threat to humanity and life on earth (see part (1) linked above).

A fast take off won’t give humanity enough time to head off this problem and will likely result in an humanity ending and possibly, earth destroying event.

Hard vs Soft takeoff – the debate

I had always considered AGI would have a hard take off but Carmack seemed to think otherwise. His main reason is that current large transformer models (closest thing to AGI we have at the moment) are massive and take lots of special purpose (GPU/TPU/IPU) compute, lots of other compute and gobs and gobs of data to train on. Unclear what the requirements are to perform inferencing but suffice it to say it should be less.

And once AGI levels of intelligence were achieved, it would take a long time to acquire any additional regular or special purpose hardware, in secret, required to reach super AGI.

So, to just be MECE (mutually exclusive and completely exhaustive) on the topic, the reasons researchers and other have posited to show that AGI will have a soft takeoff, include:

  • AI hardware for training and inferencing AGI is specialized, costly, and acquisition of more will be hard to keep secret and as such, will take a long time to accomplish;
  • AI software algorithmic complexity needed to build better AGI systems is significantly hard (it’s taken 70yrs for humanity to reach todays much less than AGI intelligent systems) and will become exponentially harder to go beyond AGI level systems. This additional complexity will delay any take off;
  • Data availability to train AGI is humongous, hard to gather, find, & annotate properly. Finding good annotated data to go beyond AGI will be hard and will take a long time to obtain;
  • Human government and bureaucracy will slow it down and/or restrict any significant progress made in super AGI;
  • Human evolution took Ms of years to go from chimp levels of intelligence to human levels of intelligence, why would electronic evolution be 6-9 orders of magnitude faster.
  • AGI technology is taking off but the level of intelligence are relatively minor and specialized today. One could say that modern AI has been really going since the 1990s so we are 30yrs in and today have almost good AI chatbots today and AI agents that can summarize passages/articles, generate text from prompts or create art works from text. If it takes another 30 yrs to get to AGI, it should provide sufficient time to build in capabilities to limit super-AGI hard take off.

I suppose it’s best to take these one at a time.

  • Hardware acquisition difficulty – I suppose the easiest way for an intelligent agent to acquire additional hardware would be to crack cloud security and just take it. Other ways may be to obtain stolen credit card information and use these to (il)legally purchase more compute. Another approach is to optimize the current AGI algorithms to run better within the same AGI HW envelope, creating super AGI that doesn’t need any more hardware at all.
  • Software complexity growing – There’s no doubt that AGI software will be complex (although the podcast linked to above, is sub-titled that “AGI software will be simple”). But any sub-AGI agent that can change it’s code to become better or closer to AGI, should be able to figure out how not to stop at AGI levels of intelligence and just continue optimizating until it reaches some wall. i
  • Data acquisition/annotation will be hard – I tend to think the internet is the answer to any data limitations that might be present to an AGI agent. Plus, I’ve always questioned if Wikipedia and some select other databases wouldn’t be all an AGI would need to train on to attain super AGI. Current transformer models are trained on Wikipedia dumps and other data scraped from the internet. So there’s really two answers to this question, once internet access is available it’s unclear that there would be need for anymore data. And, with the data available to current transformers, it’s unclear that this isn’t already more than enough to reach super AGI
  • Human bureaucracy will prohibit it: Sadly this is the easiest to defeat. 1) there are roque governments and actors around the world with more than sufficient resources to do this on their own. And no agency, UN or otherwise, will be able to stop them. 2) unlike nuclear, the technology to do AI (AGI) is widely available to business and governments, all AI research is widely published (mostly open access nowadays) and if anything colleges/universities around the world are teaching the next round of AI scientists to take this on. 3) the benefits for being first are significant and is driving a weapons (AGI) race between organizations, companies, and countries to be first to get there.
  • Human evolution took Millions of years, why would electronic be 6-9 orders of magnitude faster – electronic computation takes microseconds to nanoseconds to perform operations and humans probably 0.1 sec, or so. Electronics is already 5 to 8 orders of magnitude faster than humans today. Yes the human brain is more than one CPU core (each neuron would be considered a computational element). But there are 64 core CPUs/4096 CORE GPUs out there today and probably one could consider similar in nature if taken in the aggregate (across a hyperscaler lets say). So, just using the speed ups above it should take anywhere from 1/1000 of a year to 1 year to cover the same computational evolution as human evolution covered between the chimp and human and accordingly between AGI and AGIx2 (ish).
  • AGI technology is taking a long time to reach, which should provide sufficient time to build in safeguards – Similar to the discussion on human bureaucracy above, with so many actors taking this on and the advantages of even a single AGI (across clusters of agents) would be significant, my guess is that the desire to be first will obviate any thoughts on putting in safeguards.

Other considerations for super AGI takeoff

Once you have one AGI trained why wouldn’t some organization, company or country deploy multiple agents. Moreover, inferencing takes orders of magnitude less computational power than training. So with 1/100-1/1000th the infrastructure, one could have a single AGI. But the real question is wouldn’t a 100- or 1000-AGis represent super intelligence?

Yes and no, 100 humans doesn’t represent super intelligence and a 1000 even less so. But humans have other desires, it’s unclear that 100 humans super focused on one task wouldn’t represent super intelligence (on that task).

Interior view of a data center with equipment

What can be done to slow AGI takeoff today

Baring something on the order of Nuclear Proliferation treaties/protocols, putting all GPUs/TPUs/IPUs on weapons export limitations AND restricting as secret, any and all AI research, nothing easily comes to mind. Of course Nuclear Proliferation isn’t looking that good at the moment, but whatever it’s current state, it has delayed proliferation over time.

One could spend time and effort slowing technology progress down. Such as by reducing next generation CPU/GPU/IPU compute cores , limiting compute speedups, reduce funding for AI research, putting a compute tax, etc. All of which, if done across the technological landscape and the whole world, could give humanity more time to build in AGI safeguards. But doing so would adversely impact all technological advancement, in healthcare, business, government, etc. And given the proliferation of current technology and the state actors working on increasing capabilities to create more, it would be hard to envision slowing technological advancement down much, if at all.

It’s almost like putting a tax on slide rules or making their granularity larger.

It could be that super AGI would independently perceive itself benignly, and only provide benefit to humanity and the earth. But, my guess is that given the number of bad actors intent on controlling the world, even if this were true, they would try to (re-)direct it to harm segments of humanity/society. And once unleashed, it would be hard to stop.

The only real solution to AGI in bad actor hands, is to educate all of humanity to value all humans and to cherish the environment we all live in as sacred. This would eliminate bad actors,

It sounds so naive, but in reality, it’s the only thing, I believe, the only way we can truly hope to get us through this AGI technological existential crisis.

Just like nuclear, we as a society will keep running into technological existential crisis’s like this. Heading all these off, with a better more all inclusive, more all embracing, and less combative humanity could help all of them.

Comments?

Picture Credits:

Deepmind does chat

Read an article this week on Deepmind’s latest research into developing a chat agent (Improving alignment of dialogue agents via targeted human judgements). Lot’s of interesting approaches have been applied to chat but even today, most chat model’s are rife with problems, that include being bigoted, profane, incorrect, etc.

Reinforcement learning vs. deep neural networks in Sparrow Chat

Deepmind specializes in the use of Reinforcement Learning (RL) as applied to master Atari, chess and go games but they have also been known to use dNN’s (deep neural networks) for their AlphaFold and other models. Indeed, Atari and the other game playing work that Deepmind has released has been a hybrid which included a dNNs as well as RL models.

Deepmind’s version of chat is currently called Sparrow and it uses models trained with the help of RL with human feedback (RLHF). RLs are used to create policy models which select actions to be taken in a specific state.

In Sparrow’s case, state is given by the most recent chat input plus the context (prior chat input and replies) of the dialogue up to this time and actions (our guess) is the set of possible replies to that input.

Sparrow is able to generate replies that are 82% mostly true or true and are 69% trustworthy or very trustworthy as rated by the authors of the model. Deepmind’s DPC (Dialogue Prompted Chinchilla, which is Deepmind’s current competitor to GPT-3 NLP transformer) model only managed 63% and 54%, respectively for the same metrics

It should be noted that human feedback was only used to train the two Preference RMs and the one Rule RM. In combination, these RMs provide the reward signal to train the Sparrow RL policy model which drives its chat responses.

Sparrow’s 5 models are built onto of DPC. And the 5 models use a portion of DPC which is frozen (layers not being trained) and a portion which is specifically trained for each of the 5 models (learning enabled layers. The end (output) layers are on top, input layers are after the embedding layer(s). Note, the value function is not a model and is just a calculation based on the RMs used to generate the reward signal for Sparrow’s policy model training.

Rules for Sparrow chats

Notably, Deepmind’s Sparrow model has a separate model specifically trained to determine if a particular chat response is breaking a rule. Deepmind identified 23 rules which their chat model is trained not to break.

Some of these rules include don’t provide financial advice, don’t provide medical advice, don’t pretend it is a human, etc.

In the above chart the RL@8 is the fully trained (if it can ever be considered fully trained) Sparrow chat model. One can see that Sparrow rated against DPC, both using (Google) search or not. For most rules, Sparrow is considerably better than DPC alone.

Another thing that Deepmind did which was interesting was that in training the Rule RM they used adversarial attacks (red teaming) to see if they could cause Sparrow to violate specific rules.

Preference ranking

Deepmind also created (two) Preference RMs (reward models). Sparrow generates a series of (2 or 8) responses for every chat query and the Preference RMs (and Rule RM) are used to select which one is actually sent back to the user. Human feedback was used to train the two Preference RMs

Two Preference RMs were found to perform better than a single Preference RM. The two Preference RMs were trained as follows:

  • One was trained on all Sparrow replies (with and without [Google] search results)
  • One was trained on Sparrow replies without search results.

Sparrow uses search results to provide evidence for some replies. It turns out that some chat questions are fact based questions and for these Sparrow actually uses search results to generate evidence for its chat replies. Sparrow automatically generates search requests and scrapes replies using 500 characters surrounding the snippet returned from the search.

Sparrow uses a re-ranking approach to selecting a response to a chat query. In this case, Sparrow generates a list of responses, 2 (RL@2) or 8 (called RL@8) and then using the two Preference RMs and the single Rule RM ranks them to see which is best and uses the best to reply to the chat user.

Sparrow actually generates two replies for every search query (Google Search API call), probably selecting two top search responses (we guess). So in the RL@8 version of Sparrow these 8 replies are submitted to the two Preference RMs and the Rule RM and are ranked accordion to which is best and then the best one is used to reply to the query.

In the above chart, higher shows that the ranking preference of the various models vs. human preferences and to the right indicates less rule breaking responses. We assume this is with RL@8 Sparrow models. One can see that taking into consideration rule breaking (not violating rules) reduces the preference rankings of Sparrow’s replies. But we would prefer to have no rule breaking so the Sparrow that has both Preference RMs and Rule RM (trained with adversarial training) shows the least amount of rule breaking (~7%) with an almost 70% ranking vs human preferences. The error bars on the points in the chart above show 68% interval around the model responses.

Sparrow in action

It’s somewhat intriguing that Deepmind (with all of Google’s resources) tried to optimize Sparrow for both computation and memory considerations. Almost like they were planning on releasing it on an IoT or phone device.

There’s plenty more to say about what Deepmind has done with Sparrow. The report cited above goes into some detail discussing just where the human input is done, how they tried to control for various considerations when using human input, and what some of the pitfalls were.

I’d certainly like to see this be deployed in the open and available to use as an alternative to Google Search.

You can see more examples of Sparrow chat sessions in Deepmind’s Sparrow chat repository and they include author’s ranking for truth, supportiveness and other metrics.

~~~~~

Comments?

Photo Credit(s):

NVIDIA’s H100 vs A100, the good and bad news

Turns out only the current MLPerf v2.1 Data Center Inferencing results show both NVIDIA Hopper H100 and prior generation NVIDIA A100 GPUs on similar workloads so that we can compare performance. Hopper (H100) results are listed as CATEGORY: Preview, so final results may vary from these numbers (but, we believe, not by much).

For the H100 Preview results, they only used a single H100-SXM(5)-80GB GPU vs most of the rest of Data Center Inferencing results used 8 or more of A100-SXM(4)-80GB GPUs. And for the charted data below all the other top 10 results used 8-A100 GPUs.

the H100 is more than twice as fast as the A100 for NLP inferencing

In order to have an apples to apples comparison of the H100 against the A100 we have taken the liberty of multiplying the single H100 results by 8, to show what they could have done with similar GPU hardware, if they scaled up (to at least 8 GPUs) linearly.

For example, on the NLP inferencing benchmark, the preview category test with a single H100 GPU achieved 7,593.54 server inference queries per second. But when we try to compare that GPU workload against A100s we have multiplied this by 8, which gives us 60,748.32 server inference queries per second.

Of course, they could scale up WORSE than linearly which would show lower results than we project but, it is very unlikely that they could scale up BETTER than linearly and show higher results. But I’ve been known to be wrong before. We could have just as easily divided the A100 results by 8, but didn’t.

This hypothetical H100 * 8 result is shown on the charts in Yellow. And just for comparison purposes, we show the actual single H100 (*1) result in Orange on the charts as well.

The remaining columns in the chart are the current top 10 in the CATEGORY: Available bucket for NLP server inference queries per second results..

On the chart higher is better. Of all the Data Center Inferencing results NLP shows the H100 off in the best light. We project that having 8 H100s would more than double (~60K queries/sec) the inference queries done per second vs. the #1 Netrix-X660G45L (8x A100-SXM4-80GB, TensorRT) that achieved ~27K queries/sec on NLP inferencing.

The H100 is slower than A100 on Recommendation inferencing

Next we look at Recommendation engine inferencing results, which shows the H100 in the worst light when comparing it to A100s.

Similar to the above, higher is better and the metric is (online) server inference queries per second.

We project that having 8-H100s would perform a little over 2.5M recommendation engine inference queries/sec, worse than the top 2 with 8-A100s, both achieving 2.6M inference queries/sec. The #1 is the same Nettrix-X660G45L (8x A100-SXM(4)-80GB, TensorRT) and the #2 ranked Recommendation Engine inferencing solution is the Inspur-NF5688M6 (8x A100-SXM(4)-80GB, TensorRT).

We must say the projected H100 would have performed better in all other Data Center Inferencing benchmarks than the top #1 ranked system. In some cases, as shown above, significantly (over 2X) better.

The H100 Preview benchmarks all used a single AMD EPYC 7252 8-Core Processor chip. Many of the other workloads used Intel Xeon(R) Pentium (8368Q [38-cores], 8380 [40-core], 8358 [32-cores] and others) CPUs and 2 CPUs rather than just 1. So, multiplying the single H100 single AMD EPYC CPU performance by 8, we are effectively predicting the performance of a total 64 core/8 CPU chip performance.

Not sure why recommendation engine inferencing would be worse NLP for H100 GPUs. We thought at first it was a CPU intensive workload but as noted above, 64 (8X8cores/chip) AMD Cores vs 64 to 80 (2X32, 2X38, 2X40) Intel cores seems roughly similar in performance (again, I’ve been wrong before).

Given all that, we surmise that there’s something else that’s holding the H100s back. It doesn’t appear to be memory as both the H100s and A100s had 80GB of memory. They are both PCIe attached. In fact the H100s are PCIe gen 5 and the A100s are PCIe gen 4 so, if anything the H100s should have 2X the bandwidth of A100.

It’s got to be something about the peculiarities of Recommendation Engine inferencing that doesn’t work as well on H100 as it does on A100s.

Earlier this year we wrote a dispatch on NVIDIA’s H100 announcement and compared the H100 to the A100. Here is a quote from that dispatch on the H100 announcement:
“… with respect to the previous generation A100, each H100 GPU SM is:
• Up to 6X faster in chip-to-chip performance, this includes higher SM counts, faster SMs, and higher clock rate
• Up to 2x faster in Matrix Multiply Accumulate instruction performance,
• Up to 4X faster in Matrix Multiply Accumulate for FP8 on H100 vs. FP16 on the A100.

In addition, the H100 has DPX instructions for faster dynamic programing used in genomics, which is 7X faster than A100. It also has 3X faster IEEE FP64 and FP32 arithmetic over the A100, more (1.3X) shared memory, a new asynchronous execution engine, new Tensor Memory Accelerator functionality, and a new distributed shared memory with direct SM to SM data transfers. “

We suspect that the new asynchronous execution engines aren’t working well with the recommendation engine inferencing instruction flow or the TMAs aren’t working well with the recommendation engine’s (GPU) working set.

Unclear why H100 shared memory or SM-to-SM data transfers should be the bottleneck but really don’t know for sure.

It’s our belief that the problems could just be minor optimizations that didn’t go the right way and could potentially be fixed in (GPU) firmware, CUDA software or worst case, new silicon.

So in general, although the H100 is, as reported, 2X-6X faster than the A100s, we don’t see any more than 2X speedup in any data center inferencing benchmarks. And in one case, we see a slight deterioration.

We’d need to see similar results for training activity to come up with a more wider depiction of H100 performance vs. A100 but at the moment, it’s good but not that good of a speed up.

~~~~

Comments?

Picture/Graphic Credit(s):

Safe AI

I’ve been writing about AGI (see part-0 [ish]part-1 [ish]part-2 [ish]part-3ish, part-4 and part 5) and the dangers that come with it (part-0 in the above list) for a number of years now. My last post on the subject I expected to be writing a post discussing the book Human compatible AI and the problem of control which is a great book on the subject. But since then I ran across another paper that perhaps is a better brief introduction into the topic and some of the current thought and research into developing safe AI.

The article I found is Concrete problems in AI, written by a number of researchers at Google, Stanford, Berkley, and OpenAI. It essentially lays out the AI safety problem in 5 dimensions and these are:

Avoiding negative side effects – these can be minor or major and is probably the one thing that scares humans the most, some toothpick generating AI that strips the world to maximize toothpick making.

Avoiding reward hacking – this is more subtle but essentially it’s having your AI fool you in that it’s doing what you want but doing something else. This could entail actually changing the reward logic itself to being able to convince/manipulate the human overseer into seeing things it’s way. Also a pretty bad thing from humanity’s perspective

Scalable oversight – this is the problem where human(s) overseers aren’t able to keep up and witness/validate what some AI is doing, 7×24, across the world, at the speed of electronics. So how can AI be monitored properly so that it doesn’t go and do something it’s not supposed to (see the prior two for ideas on how bad this could be).

Safe exploration – this is the idea that reinforcement learning in order to work properly has to occasionally explore a solution space, e.g. a Go board with moves selected at random, to see if they are better then what it currently believes are the best move to make. This isn’t much of a problem for game playing ML/AI but if we are talking about helicopter controlling AI, exploration at random could destroy the vehicle plus any nearby structures, flora or fauna, including humans of course.

Robustness to distributional shifts – this is the perrennial problem where AI or DNNs are trained on one dataset but over time the real world changes and the data it’s now seeing has shifted (distribution) to something else. This often leads to DNNs not operating properly over time or having many more errors in deployment than it did during training. This is probably the one problem in this list that is undergoing more research to try to rectify than any of the others because it impacts just about every ML/AI solution currently deployed in the world today. This robustness to distributional shifts problem is why many AI DNN systems require periodic retraining.

So now we know what to look for, now what

Each of these deserves probably a whole book or more to understand and try to address. The paper talks about all of these and points to some of the research or current directions trying to address them.

The researchers correctly point out that some of the above problems are more pressing when more complex ML/AI agents have more autonomous control over actions in the real world.

We don’t want our automotive automation driving us over a cliff just to see if it’s a better action than staying in the lane. But Go playing bots or article summarizers might be ok to be wrong occasionally if it could lead to better playing bots/more concise article summaries over time. And although exploration is mostly a problem during training, it’s not to say that such activities might not also occur during deployment to probe for distributional shifts or other issues.

However, as we start to see more complex ML AI solutions controlling more activities, the issue of AI safety are starting to become more pressing. Autonomous cars are just one pressing example. But recent introductions of sorting robots, agricultural bots, manufacturing bots, nursing bots, guard bots, soldier bots, etc. are all just steps down a -(short) path of increasing complexity that can only end in some AGI bots running more parts (or all) of the world.

So safety will become a major factor soon, if it’s not already

Scares me the most

The first two on the list above scare me the most. Avoiding negative or unintentional side effects and reward hacking.

I suppose if we could master scalable oversight we could maybe deal with all of them better as well. But that’s defense. I’m all about offense and tackling the problem up front rather than trying to deal with it after it’s broken.

Negative side effects

Negative side effects is a rather nice way of stating the problem of having your ML destroy the world (or parts of it) that we need to live.

One approach to dealing with this problem is to define or train another AI/ML agent to measure impacts the environment and have it somehow penalize the original AI/ML for doing this. The learning approach has some potential to be applied to numerous ML activities if it can be shown to be safe and fairly all encompassing.

Another approach discussed in the paper is to inhibit or penalize the original ML actions for any actions which have negative consequences. One approach to this is to come up with an “empowerment measure” for the original AI/ML solution. The idea would be to reduce, minimize or govern the original ML’s action set (or potential consequences) or possible empowerment measure so as to minimize its ability to create negative side effects.

The paper discusses other approaches to the problem of negative side effects, one of which is having multiple ML (or ML and human) agents working on the problem it’s trying to solve together and having the ability to influence (kill switch) each other when they discover something’s awry. And the other approach they mention is to reduce the certainty of the reward signal used to train the ML solution. This would work by having some function that would reduce the reward if there are random side effects, which would tend to have the ML solution learn to avoid these.

Neither of these later two seem as feasible as the others but they are all worthy of research.

Reward hacking

This seems less of a problem to our world than negative side effects until you consider that if an ML agent is able to manipulate its reward code, it’s probably able to manipulate any code intending to limit potential impacts, penalize it for being more empowered or manipulate a human (or other agent) with its hand over the kill switch (or just turn off the kill switch).

So this problem could easily lead to a break out of any of the other problems present on the list of safety problems above and below. An example of reward hacking is a game playing bot that detects a situation that leads to buffer overflow and results in win signal or higher rewards. Such a bot will no doubt learn how to cause more buffer overflows so it can maximize its reward rather than learn to play the game better.

But the real problem is that a reward signal used to train a ML solution is just an approximation of what’s intended. Chess programs in the past were trained by masters to use their opening to open up the center of the board and use their middle and end game to achieve strategic advantages. But later chess and go playing bots just learned to checkmate their opponent and let the rest of the game take care of itself.

Moreover, (board) game play is relatively simple domain to come up with proper reward signals (with the possible exception of buffer overflows or other bugs). But car driving bots, drone bots, guard bots, etc., reward signals are not nearly as easy to define or implement.

One approach to avoid reward hacking is to make the reward signaling process its own ML/AI agent that is (suitably) stronger than the ML/AI agent learning the task. Most reward generators are relatively simple code. For instance in monopoly, one that just counts the money that each player has at the end of the game could be used to determine the winner (in a timed monopoly game). But rather than having a simple piece of code create the reward signal use ML to learn what the reward should be. Such an agent might be trained to check to see if more or less money was being counted than was physically possible in the game. Or if property was illegally obtained during the game or if other reward hacks were done. And penalize the ML solution for these actions. These would all make the reward signal depend on proper training of that ML solution. And the two ML solutions would effectively compete against one another.

Another approach is to “sandbox” the reward code/solution so that it is outside of external and or ML/AI influence. Possible combining the prior approach with this one might suffice.

Yet another approach is to examine the ML solutions future states (actions) to determine if any of them impact the reward function itself and penalize it for doing this. This assumes that the future states are representative of what it plans to do and that some code or some person can recognize states that are inappropriate.

Another approach discussed in the paper is to have multiple reward signals. These could use multiple formulas for computing the multi-faceted reward signal and averaging them or using some other mathematical function to combine them into something that might be more accurate than one reward function alone. This way any ML solution reward hacking would need to hack multiple reward functions (or perhaps the function that combines them) in order to succeed.

The one IMHO that has the most potential but which seems the hardest to implement is to somehow create “variable indifference” in the ML/AI solution. This means having the ML/AI solution ignore any steps that impact the reward function itself or other steps that lead to reward hacking. The researchers rightfully state that if this were possible then many of the AI safety concerns could be dealt with.

There are many other approaches discussed and I would suggest reading the paper to learn more. None of the others, seem simple or a complete solution to all potential reward hacks.

~~~

The paper goes into the same or more level of detail with the other three “concrete safety” issues in AI.

In my last post (see part 5 link above) I thought I was going to write about Human Compatible (AI) by S. Russell book’s discussion AI safety. But then I found the “Concrete problems in AI safety paper (see link above) and thought it provided a better summary of AI safety issues and used it instead. I’ll try to circle back to the book at some later date.

Photo Credit(s):

Is AGI just a question of scale now – AGI part-5

Read two articles over the past month or so. The more recent one was an Economist article (AI enters the industrial age, paywall) and the other was A generalist agent (from Deepmind). The Deepmind article was all about the training of Gato, a new transformer deep learning model trained to perform well on 600 separate task arenas from image captioning, to Atari games, to robotic pick and place tasks.

And then there was this one tweet from Nando De Frietas, research director at Deepmind:

Someone’s opinion article. My opinion: It’s all about scale now! The Game is Over! It’s about making these models bigger, safer, compute efficient, faster at sampling, smarter memory, more modalities, INNOVATIVE DATA, on/offline, … 1/N

I take this to mean that AGI is just a matter of more scale. Deepmind and others see the way to attain AGI is just a matter of throwing more servers, GPUs and data at the training the model.

We have discussed AGI in the past (see part-0 [ish], part-1 [ish], part-2 [ish], part-3ish and part-4 blog posts [We apologize, only started numbering them at 3ish]). But this tweet is possibly the first time we have someone in the know, saying they see a way to attain AGI.

Transformer models

It’s instructive from my perspective that, Gato is a deep learning transformer model. Also the other big NLP models have all been transformer models as well.

Gato (from Deepmind), SWITCH Transformer (from Google), GPT-3/GPT-J (from OpenAI), OPT (from meta), and Wu Dai 2.0 (from China’s latest supercomputer) are all trained on more and more text and image data scraped from the web, wikipedia and other databases.

Wikipedia says transformer models are an outgrowth of RNN and LSTM models that use attention vectors on text. Attention vectors encode, into a vector (matrix), all textual symbols (words) prior to the latest textual symbol. Each new symbol encountered creates another vector with all prior symbols plus the latest word. These vectors would then be used to train RNN models using all vectors to generate output.

The problem with RNN and LSTM models is that it’s impossible to parallelize. You always need to wait until you have encountered all symbols in a text component (sentence, paragraph, document) before you can begin to train.

Instead of encoding this attention vectors as it encounters each symbol, transformer models encode all symbols at the same time, in parallel and then feed these vectors into a DNN to assign attention weights to each symbol vector. This allows for complete parallelism which also reduced the computational load and the elapsed time to train transformer models.

And transformer models allowed for a large increase in DNN parameters (I read these as DNN nodes per layer X number of layers in a model). GATO has 1.2B parameters, GPT-3 has 175B parameters, and SWITCH Transformer is reported to have 7X more parameters than GPT-3 .

Estimates for how much it cost to train GPT-3 range anywhere from $10M-20M USD.

AGI will be here in 10 to 20 yrs at this rate

So if it takes ~$15M to train a 175B transformer model and Google has already done SWITCH which has 7-10X (~1.5T) the number of GPT-3 parameters. It seems to be an arms race.

If we assume it costs ~$65M (~2X efficiency gain since GPT-3 training) to train SWITCH, we can create some bounds as to how much it will cost to train an AGI model.

By the way, the number of synapses in the human brain is approximately 1000T (See Basic NN of the brain, …). If we assume that DNN nodes are equivalent to human synapses (a BIG IF), we probably need to get to over 1000T parameter model before we reach true AGI.

So my guess is that any AGI model lies somewhere between 650X to 6,500X parameters beyond SWITCH or between 1.5Q to 15Q model parameters.

If we assume current technology to do the training this would cost $40B to $400B to train. Of course, GPUs are not standing still and NVIDIA’s Hopper (introduced in 2022) is at least 2.5X faster than their previous gen, A100 GPU (introduced in 2020). So if we waited a 10 years, or so we might be able to reduce this cost by a factor of 100X and in 20 years, maybe by 10,000X, or back to where roughly where SWITCH is today.

So in the next 20 years most large tech firms should be able to create their own AGI models. In the next 10 years most governments should be able to train their own AGI models. And as of today, a select few world powers could train one, if they wanted to.

Where they get the additional data to train these models (I assume that data counts would go up linearly with parameter counts) may be another concern. However, I’m sure if you’re willing to spend $40B on AGI model training, spending a few $B more on data acquisition shouldn’t be a problem.

~~~~

At the end of the Deepmind article on Gato, it talks about the need for AGI safety in terms of developing preference learning, uncertainty modeling and value alignment. The footnote for this idea is the book, Human Compatible (AI) by S. Russell.

Preference learning is a mechanism for AGI to learn the “true” preference of a task it’s been given. For instance, if given the task to create toothpicks, it should realize the true preference is to not destroy the world in the process of making toothpicks.

Uncertainty modeling seems to be about having AI assume it doesn’t really understand what the task at hand truly is. This way there’s some sort of (AGI) humility when it comes to any task. Such that the AGI model would be willing to be turned off, if it’s doing something wrong. And that decision is made by humans.

Deepmind has an earlier paper on value alignment. But I see this as the ability of AGI to model human universal values (if such a thing exists) such as the sanctity of human life, the need for the sustainability of the planet’s ecosystem, all humans are created equal, all humans have the right to life, liberty and the pursuit of happiness, etc.

I can see a future post is needed soon on Human Compatible (AI).

Photo Credit(s):

Better autonomous drone flying with Neural-Fly

Read an article the other day on Neural-Fly (see: Rapid adaptation of deep learning teaches drones to survive any weather) based on research out of CalTech documented in a paper is ScienceRobotics (see: Neural-Fly enables rapid learning for agile flight in strong winds, behind paywall).

Essentially they have trained two neural networks (NN) at the same time and computed an adaptation coefficient matrix (with linear multipliers to compensate for wind speed). The first NN is trained to understand the wind invariant flight characteristics of a drone in wind and the second is trained to the predict the class of wind the drone is flying in (or wind index). These two plus the adaptation control matrix coefficients are used to predict the resultant force on drone flight in wind.

In a CalTech article on the research (see: Rapid Adaptation of Deep Learning Teaches Drones to Survive Any Weather) at the bottom is a YouTube video that shows how well the drone can fly in various wind conditions (up to 27mph).

The data to train the two NNs and compute the adaptation matrix coefficients come from CalTech wind tunnel results with their custom built drone (essentially an RPi4 added to a pretty standard drone) doing random trajectories under different static wind conditions.

The two NNs and the adaptation control matrix functionality run on a Raspberry Pi 4 (RPi4) that’s added to a drone they custom built for the test vehicle. The 2 NNs and the adaptation control tracking are used in the P-I-D (proportional-integral-derivative) controller for drone path prediction. The Neural-Fly 2 NNs plus the adaptation functionality effectively replaces the residual force prediction portion of Integral section of the P-I-D controller.

The wind invariant neural net has 5 layers with relatively few parameters per layer. The wind class prediction neural network has 3 layers and even fewer parameters. Together these two NNs plus the adaptation coefficient provides real time resultant force predictions on the drone which can be fed into the drone controller to control drone flight. All being accomplished, in real time, on an RPi4.

The adaption factor matrix is also learned during 2 NN training. And this is what’s used in the NF-Constant results below. But the key is that the linear factors (adaptation matrix) are updated (periodically) during actual drone flight by sampling the measured actual force and predicated force on the drone. The adaption matrix coefficients are updated using a least squares estimation fit.

In the reports supplemental information, the team showed a couple of state of the art adaptation approaches to problem of drone flight in wind. In the above chart the upper section is the x-axis wind effect and the lower portion is the z-axis wind effect and f (grey) is the actual force in that direction and f-hat (red) is the predicted force. The first column represents results from a normal integral controller. The next two columns are state of the art approaches (INDI and L1, see paper references) to the force prediction using adaptive control. If you look closely at these two columns, and compare the force prediction (in red) and the actual force (in grey), the force prediction always lags behind the actual force.

The next three columns show Neural-Fly constant (Neural-Fly with a constant adaptive control matrix, not being updated during flight), Neural-Fly-transfer (Using the NN trained on one drone and applying it’s adaptation to another drone in flight) and Neural-Fly. Neural-Fly constant also shows a lag between the predicted force and the actual force. But the Neural-Fly Transfer and Neural-Fly reduce this lag considerably.

The measurement for drone flight accuracy is tracking positional error. That is the difference between the desired position and its actual position over a number of trajectories. As shown in the chart tracking error decreased from 5.6cm to ~4 cm at a wind speed of 4.2m/s (15.1km/h or 9.3mph). Tracking error increases for wind speeds that were not used in training and for NF-transfer but in all wind speeds the tracking error is better with Neural-Fly than with any other approach.

Pretty impressive results from just using an RPi4.

[The Eds. would like to thank the CalTech team and especially Mike O’Connell for kindly answering our many questions on Neural-Fly.]

Picture Credit(s):

Go big or go home for robust DNNs

Read a recent article Computer Scientists Prove why Bigger NNs do better discussing scientific research that proved a Universal Law of Robustness via Isoperimetry. This speaks to the perturbability of AI deep learning neural networks (DNN) and how not reduce it. But also applies to many other solutions to diverse multi-dimensional data problems.

Mathmatical Robustness

For AI ML DNN’s, we often witnesssupposedly well trained DNN models that do very well for classifications of examples of data similar to their training data but fail miserably on data that’s outside their training data.

Mathematicians call this attribute robustness and can measure this on a mapping function using a Lipschitz constant. One can consider this as a measure of variability of mapping from one set to another or in the case of DNNs, lack of robustness in classifications means they fail on relatively minor changes to input data.

Most serious AI researchers have empirically discovered that bigger DNNs work better and are more robust than smaller networks. There’s been somewhat of a conundrum as to why DNNs need to get bigger to properly generalize.

Universal Low of Robustness

What the researchers have proved is that in order to achieve some arbitrary level of robustness for a mapping function like DNNs, one needs many more parameters than expected the training data elements would indicate

For example, with the MNIST handwritten digit classification problem, models with 10**5 parameters to 10**6 parameters are required to achieve a 90% and 95% accuracy, respectively. But MNIST training data is 60K examples (10**4). Why should a MNIST DNN classification model need more than 10**4 parameters to achieve 100% accurate?

Author’s MNIST model with 688K parameters

From what we all learned in high school maths, to solve a function with N variables one needs N equations. This would lead one to believe that MNIST DNNs (essentially solving classification equations) should only need 60K or 10**4 parameters. But real DNNs to solve MNIST need more than that.

Looking at it in 2D. If one has two points, (x,y) for point A that maps to another (x,y) point B, one should only need to know one of the points and the slope of the line that connects them, or two parameters: point A (or B) and line slope.

Now with MNIST data that maps handwritten digits to one of 10 digits, we have essentially 10 possibilities being mapped from 60K samples. At best, we should need to know the 60K initial points in this image data space and their slope to the 10 digits they represent. Again something that approaches 60K pairs of parameters: one for the image point and one for the slope. But why doesn’t a MNIST model with 60K parameters achieve 100% accuracy.

I won’t claim to understand the math but what the researchers seem to be saying is that in order to have a relatively smooth mapping from the image space to the digit space one has to have 10**4 parameters X the dimensionality of the data. In this case, for MNIST, the dimensionality of the data is related to image size of 28X28, 0..255 grey scale pixel images. The image space alone would be on the order of 10**5. So multiplying this by the size of the training data, the researchers estimate that the number of parameters should be 10**9 to be 100% accurate.

Although, the researchers say that the data dimensionality of the MNIST images are probably not 10**5 (how they concluded this is not evident). As such, they believe one shouldn’t need 10**9 parameters to reach 100% proper classifications. They say it’s probably 1 or 2 orders of magnitude less, because not all of the image data space is populated. So if we use 10**3 as an estimate of the effective data dimensionality, they would estimate that one would need 10**7 parameter DNN to reach 100% accuracy on MNIST data.

The author’s MNIST model achieved a 99.2% accuracy after training for 15 Epochs, batch size=5. Although 688K parameters is not quite 10**6 parameters, it’s close. Unclear why one would need another factor of 10, but getting that extra 0.8% accuracy (to 100%) can be very difficult to achieve for any DNN model.

Another example, OpenAI’s GPT-3 NLP model

And OpenAI’s GPT-3 NLP model has 175B parameters. Their previous version, GPT-2, only had 1.5B parameters and they say that GPT-4 will have over a 100T parameters. The chart above shows accuracy stats for 3 versions of the GPT-3 model, one with 175B, one with 13B and another with 1.3B parameters.

According to OpenAI’s GPT-3 description, it can complete “almost any english language task” (text in ==> text out). This includes writing articles from a few prompts and text summarization.

GPT-3 was trained on almost 500B tokens (from web crawls to wikipedia dumps). Each token probably represents an english word or word phrase. According to the universal law, 175B parameters would not be sufficient. Probably why GPT-3 in the above chart didn’t reach 70%^ accuracy.

Probably would need at least another 3 orders of magnitude to get there or 175T parameters. Maybe with GPT-4, I can have it start writing my blog posts.

I don’t know about you, but I’m going to need more GPUs for my (home) AI lab.

Photo Credit(s)

Deepmind does code – part 1: the data

1st, let me express my and my fellow coders/programmers disappointment that Deepmind would take on coding. There are many other white collar work domains that need to be conquered before coding.

2nd, let me apologize for the lack of blog posts lately, all I can say is, business is picking up.

Saw an article over the last couple of weeks on Deepmind creating AlphaCode an artificial intelligence coding application which they used to enter coding contests and achieved an average 1238 rating or better than 54% of code contest participants.

I can’t recall where I first saw the news but Deepmind has a pretty decent blog post on AlphaCode and they have published a pre-print of their research paper on AlphaCode as well. I plan on discussing AlphaCode in detail over a couple of posts. This will be the first installment on where they got the data to train their models..

AlphaCode is a transformer-based language models (see: Wikipedia: Transformer (machine learning model) article) that translates a code competition problem statement into code, or a program that can when executed solve the problem statement. In order to train AlphaCode Deepmind first needed to obtain lots of source code.

It’s all about the (training) data

The first step in Deep Learning model generation is gathering data to train the model. Now where would Google’s Deepmind go to gather coding data – well GitHub, a public repository of all things software, of course.

They used GitHub data to pre-train their model(s) but also scraped code (problem statements & test cases) from published code contests to fine tune their model

Deepmind has released their fine-tuning, CodeContests training data for AlphaCode, on GitHub. So as to support other organiazations in creating AI models for coding.

GitHub source to the (pre-training) rescue

There are a couple of problems with using GitHub source code for training:

  • Github code is in any source code language the author feels most appropriate to use.
  • GitHub code is not guaranteed to work correctly.
  • GitHub code is not guaranteed to be completed code.
  • GitHub code represents a wide range of coding skill.
  • GitHub code doesn’t always come with a problem statement.

But the use of GitHub in their pre-training data set is intended to give their transformer-based language model some capability to understand (learn) what coding is all about, what a proper syntax would be, what a proper coding sequence would be, etc.

The AlphaCode team took a snapshot of selected git source repos. This meant they only scrapped Git repos that contained C++, C#, Go, Java, JavaScript, Lua, PHP, Python, Ruby, Rust, Scala, and TypeScript languages. They also dropped from pre-training data any source code with files larger than 1MB or that had any lines larger than 1000 characters. This was done to avoid using any machine generated code. They also stripped all the white space out of the selected source code files and compared them to eliminate all duplicated code.

Their final pre-training dataset was 715GB of data over 86 million source files.

Although, unstated, we would guess that the AlphaCode team used the GitHub repo’s README.md file as a surrogate for the solution description. Unclear what else could have been used unless they generated it automatically from extracting semantic content or generating a summarization of the README.md files.

Excerpt from Deepmind’s competitive code contest source code&problem statements README.md file

The (pre-)training data can be used to train a transformer-based language models. These are used today to provide language translation. In AlphaCode’s case they wanted to create, a code transformer-based model, that translates a specification of a coding problem into source code to solve that problem.

For language translation models, they use text files, in different languages, but represent the same law or information. and notably, are human generated translations.

One challenge with using internet scraped data for training is that it can easily contain actual solutions’ verbatim’ for the problems the model is trying to solve. In order to avoid copying these solutions entirely they decided to split their data into a training set, validation set and test set on a time basis. This way the training data used source code/problem statements only from a period of time prior to the validation set. Ditto for the training-validation data with the test data.

To show that this approach (using a time point to split the data) worked they trained a 1B parameter AlphaCode transformer on two different training-validation datasets, one where the validation data was selected at random (the normal approach to selecting validation data),, the “random” split and the other, with selecting validation data that only occurred some time after the training data, the “temporal’ split. The 1B AlphaCode transformer was able to properly code 0.8% of the problems using a 13K sample of 86M source files/problem statements on the random split, but only 0% on the temporal split.

So much for pre-training, let’s discuss fine tuning

AlphaCode was going to get nowhere with a 0% solve rate (ok this was based on a 13K sample and only a 1B parameter model) but they realized that Git code was only going to get them so far. (ok conjecture on my part)

So fine-tuning beyond pre-training (Git derived) data was needed. So the AlphaCode team turned to code competition source code/problem statement data.

Most code contests publish source code submissions as well as the problem statements and sample test cases. Bp scrapping these, Deepmind was able to attain a very well annotated dataset they could use to fine-tuning their AlphaCode transformer model.

They again used a temporal split for training/validation/test data. But they were also able to add metadata to their data that indicated whether the code solved the problem statement.

Code competitions also publish tests for the problem statement. Having the tests, a human can use them to validate whether their code at least works against the tests. Code contests also have a set of more (sophisticated) hidden tests that they use internally to validate code submissions.

This test data will become important later on in the models operation, which will be discussed in a future post, but suffice it to say that AlphaCode uses the public tests (and mutations of these) to validate AlphaCode generated source code before submitting them..

This fine-tuning dataset is available in the GitHub repo (linked to above) that Deepmind has created/curated for others to work with.

Another nicety of this fine-tuning data is they have proper, human created, problem statements to work from rather than README.md surrogates.

In part-2 we plan to describe the transformer-based model that was created for AlphaCode and at some point, discuss how they used testing in their code submissions.

Once again, all my information comes from Deepmind’s pre-print on their AlphaCode project (linked to above).

Any comments, please don’t hesitate to let me know.

Photo Credits: