Read a recent article in the NY Times, An industry insider drives an open alternative to big tech’s AI, about the Allen Institute for AI releasing a massive corpus of data, Dolma: 3 Trillion Token Open Corpus for Language Model Pre-trainning, that can be used to train LLM’s, available to be downloaded from HuggingFace.
The intent of the data release is to at some point, end up supplying an open source alternative to closed source Google/OpenAI LLMs and a more fully opened source LLM than Meta’s Llama 2, that the world’s research community can use to understand, de-risk and further AI and ultimately AGI development.
We’ve written about AGI before (see our latest, One agent to rule them all – AGI part 7, which has links to parts 1-6 of our AGI posts). Needless to say it’s a very interesting topic to me and should be to the rest of humankind. LLM is a significant step towards AGI IMHO.
One of the Allen Institute for AI’s (AI2) major goals is to open source an LLM (see Announcing AI2 OLMo, an Open Language Model Made by Scientists for Scientists), including the data (Dolma), the model, it’s weight, the training tools/code, the evaluation tools/code, and everything else that went into creating their OLMo (Open Language Model) LLM.
This way the world’s research community can see how it was created and perhaps help in insuring it’s a good (whatever that means) LLM. Releasing Dolma is a first step towards a truly open source LLM.
The Dolma corpus
AI2 has released a report on the contents of Dolma (dolma-datasheet.pdf) which documents much of what went into creating the corpus.
The datasheet goes into a good level of detail into where the corpus data came from and how each data segment is licensed and other metadata to allow researchers the ability to understand its content.
For example, in the Common Crawl data, they have included all of the websites URL as identifiers and for The Stack data the names of GitHub repo used are included in the data’s metadata.
In addition, the Dolma corpus is released under an AI2 ImpACT license as a medium risk artifact, which requires disclosure for use (download). Medium risk ImpACT licensing means that you cannot re-distribute (externally) any copy of the corpus but you may distribute any derivatives of the corpus with “Flow down use restrictions”, “Attribution” and “Notices”.
Which seems to say you can do an awful lot with the corpus and still be within its license restrictions. They do require an Derivative Impact Report to be filed which is sort of a model card for the corpus derivative you have created.
What’s this got to do with AGI
All that being said, the path to AGI is still uncertain. But the textual abilities of recent LLM releases seems to be getting closer and closer to something that approaches human skill in creating text, code, interactive agents, etc. Yes, this may be just one “slim” domain of human intelligence, but textual skills, when and if perfected, can be applied to much that white collar workers do these these days, at least online.
A good text LLM would potentially put many of our jobs at risk but could also possibly open up a much more productive, online workforce, able to assimilate massive amounts of information, and supply correct-current-vetted answers to any query.
The elephant in the room
But all that begs the real question behind AI2’s open sourcing OLMo, which is how do we humans create a safe, effective AGI that can benefit all of mankind rather than any one organization or nation. One that can be used safely by everyone to do whatever is needed to make the world a better society for all.
Versus, some artificial intelligent monstrosity, that sees humankind or any segment of them as an enemy, to whatever it believe needs to be done, and eliminates us or worse, ignores us as irrelevant.
I’m of the opinion that the only way to create a safe and effective AGI for the world is to use an open source approach to create many (competing) AGIs. There are a number of benefits to this as I see it. With a truly open source AGI,
- Any organization (with sufficient training resources) can have access to their personally trained AGI, which means no one organization or nation can gain the lions share of benefits from AGI.
- Would allow the creation and deployment of many competing AGI’s which should help limit and check any one of them from doing us or the world any harm. .
- All of the worlds researchers can contribute to making it as safe as possible.
- All of the worlds researcher can contribute to making it as multi-culturally, effective and correct as possible.
- Anyone (with sufficient inferencing resources) can use it for their very own intelligent agent or to work on their very own personal world improvement projects.
- Many cloud or service provider organizations (with sufficient inferencing resources) could make it available as a service to be used by anyone on an incremental, OPex cost basis.
The risks of a truly open source AGI are also many and include:
- Any bad actor, nation state, organization, billionaire, etc., could copy the AGI and train it as a weapon to eliminate their enemies or all of humankind, if so inclined.
- Any bad actors could use it to swamp the internet and world’s media with biased information, disinformation or propaganda.
- Any good actor or researcher, could, perhaps by mistake, unleash an AGI on an exponentially increasing, self-improvement cycle that could grow beyond our ability to control or to understand.
- An AGI agent alone, could take it upon itself to eliminate humanity or the world as the best option to save itself
But all these are even more of a problem for closed or semi-open/semi-closed releases of AGIs. As the only organizations with resources to do LLM research are very large tech companies or large technically competent nation states. And all of these are competing across the world stage already.
The resources may still limit widespread use
One item that seems to be in the way of truly widely available AGI is the compute resources needed to train or to use one for inferencing. OpenAI has Microsoft and other select organizations funding their compute, Meta and Google have all their advertising revenue funding theirs.
AI2 seems to have access (and looking for more funding for even more access) to the EU’s LUMI (HPE Cray system using AMD EPYC CPUs and AMD Instinct GPUs) supercomputer, located in CSC data center in Finland and is currently the EU’s fastest supercomputer at 375 CPU PFlops/550 GPU PFlops (~1.5M laptops).
Not many organizations, let alone nations could afford this level of compute.
But the funny thing is that compute doubles (flops/$) every 2 years or so. So, in six years or so, an equivalent of LUMI’s compute power would only require 150K current laptops and after another six years or so, 15K laptops. At some point, ~18 years from now, one would only need ~1.5K laptops, or something any nation or organization could probably afford. Add another 15 years and we are down to under 3 laptops, which just about anyone with a family in the modern world could afford. So in ~33 years or ~2054, any of us could train an LLM on our families compute resources. And that’s just the training compute..
My guess, something like 10-100X less compute resources would be required to use it for inferencing. So that’s probably available for any organization to use right now or if not now, in 6 years or so.
I can’t wait until I can have my very own AGI to use to write RayOnStorage current-correct-vetted blog posts for me…