Scale’s marginal effectiveness – Silverton Consulting

For the past 1/2 decade or more, new generation foundation models have all become significantly (10X or more) larger in parameters than their last versions. The presumption being that more parameters will always lead to better models, better inferences, more users, etc. This has been primarily driven by compute scaling, more compute thrown at training results in bigger models.

But the problem is at some point any process reaches saturation or a point of marginal return where throwing more (of anything) at it only gets marginally better, not incrementally or at least not commensurate with the additional cost. It’s unclear if we are there yet with foundation models, but my guess we are reaching it rapidly.

It’s interesting that ChatGPT-5 seems to have the same number of parameters as ChatGPT-4 (~1.8T).

Not being an active user of foundation models, I can’t really tell if …-5 is much better than …-4, but consensus seems to be they are not getting as better as they used to.

There are probably a number of reasons why this could be the case. The data wall for one. The power and cooling cost of exponentially increasing AI model size is impacting not just training costs but inferencing costs as well. But the end of the scaling advantage maybe another.

Don’t get me wrong if it wasn’t for compute scaling we wouldn’t have the AI we have today. NN training processes were invented in the 50s of last century, but they didn’t have the compute power to use them at the time.. It wasn’t until this century that computation caught up.

As more compute power became available, those old compute bound techniques proved to be the lynchpin for DNN training and we are still riding that curve today, up to a point.

It’s just that speeding up and doing the same old DNN training will lose effectiveness at some point, if not today, then tomorrow.

I’ve seen it myself in some rudimentary models I have trained. At some point adding nodes, layers, training epochs, etc., just doesn’t always result in better models. They often get worse.

AGI

And AGI, I believe, will require us to take a different tack than current foundational model DNN training to get right. Call it a hunch. But one can see glimmers of this in the fact that AGI is always just years away.

In order to achieve AGI, for safety reasons, for planetary climate reasons, and because scale is not getting us there anymore, I strongly believe we need to rethink our approach to foundation model training.

I’m no expert but I think what needs to change is more use of (deep) reinforcement learning (DRL), not just the human feedback reinforcement learning (HFRL) used today for fine tuning foundation models. This would mean using DRL much earlier, more comprehensively in all of phases of foundational model training.

Yes, DRL also consumes compute infrastructure and more “training episodes” for DRL can often lead to better model outcomes, but not always.

DRL training for AGI models

For any reinforcement learning to work, one needs a reward signal that can be used to signal how to optimize the DRL model. So, the real challenge in the use of more DRL for foundation model training is what (or who) supplies that reward signal from some action taken by the DRL model.

Historically, for games reward signals came from the game environment (or model), for robotic motion it can come from physics simulators or movement in the real world.

But any reward signal for AGI foundation models would need much more sophistication than the above.

The easy answer is to create world simulation models. Something that could simulate how the world (in total) would react to an action (or inference) of the foundation model.

But that’s not easy, world simulation models, at the fidelity needed to support DRL for AGI foundation models don’t exist and few if any researchers (AFAIK) are working on getting us there.

But there are some rudimentary baby steps that already exist. Physics engines (or models of real world physical processes) have existed for a long time now and would no doubt be the core of any world simulation model. Nature simulation models exist at least for climate and weather and these could also be incorporated into any world model.

What’s missing would be

Geophysical world simulations that would model how the world would react to actions taken by a AGI model. I’m aware of many petroleum earth based simulations ditto for plate tectonics, wind, and water movement, but these would all need to be combined into something that provides a entire world, geophysical reactions to model actions,
Biospherical world simulations that would model (at least at some level) how the (biological, i.e. animal, plant, fungi, microbe, etc.) natural world would react to actions. Weather models may have some of this, at least with respect to carbon cycles which span human-natural boundaries but we would need a lot more.
Psychological world simulations, or something that would simulate how a person and how a population of humans would react to actions taken by a model. I am unaware of anything available at this level except for a simulation of a baby I saw at SigGraph a couple of years ago. There would need to be a lot more work here to get this up to a level to support AGI training.
Sociological-Political world simulations or something that would model how human society across the world would react to model actions. Again some of these exist, at an even more rudimentary level than financial or weather modeling, and we would need a lot of work to get them to a level of fidelity needed for AGI training.
Financial-Business world simulations that would determine the financial reactions to model actions. Some of these exist for national economies, but would need broadened to the world at large and to much finer resolution, granularity to be suitable to support AGI foundational model training.

I am certainly missing some or more critical models that may be needed for true world simulations but these could provide a start. They would need to be combined, of course, in some fashion.

And determining the various reward weights would be non-trivial. It seems to me that each of these simulations could have multiple reward signals for any action. Combining them all may be non-trivial. But those are parameter optimizations, which once we have world models working in unison we can tweak at will.

Then there’s the “action space” for an AGI model. For games and robotic motion, the actions are well defined and finite. For an AGI model, it would seem that the actions are potentially infinite. Even if we limited it to a single domain such as tokenized text strings, the magnitude of such actions would be 10K-10M X anything tried before with DRL. But I still believe it’s doable

Once we had such a model together, with a decent reward function and had some way to categorize/grasp the infinite actions that could be taken by an AGI, DRL could be used to train an AGI.

Of course this may take a few “billion or trillion” actions/training episodes to get something worthwhile out of it.

But maybe after something like (or 10M X) that we could create a safe and effective AGI.

~~~~

Comments?

Photo Credit(s):

OCP Summit 2024, AMD Hardware Optimizations for power efficient AI, presentation slide
Thomas Jefferson National Accelerator Facility (Jefferson Lab), flickr photo
SigGraph 2024, Beyond the illusion of life, Keynote presentation slide

Tag: Scale’s marginal effectiveness

The curse of Scale & AGI

AGI

DRL training for AGI models