At AIFD4 Google demonstrated Gemini 1.0 writing some code for a task that someone had. At CFD20 Google Lisa Shen demonstrated how easy it is to build a LLM-RAG from scratch using GCP Cloud Run and VertexAI APIs. (At press time, the CFD20 videos from GCP were not available but I am assured they will be up there shortly)
I swear in a matter of minutes Lisa Shen showed us two Python modules (indexer.py and server.py) that were less than 100 LOC each. One ingested Cloud Run release notes (309 if I remember correctly), ran embeddings on them and created a RAG Vector database with the embedded information. This took a matter of seconds to run (much longer to explain).
And the other created an HTTP service that opened a prompt window, took the prompt, embedded the text, searched the RAG DB with this and then sent the original prompt and the RAG reply to the embedded search to a VertexAI LLM API call to generate a response and displayed that as an HTTP text response.
Once the service was running, Lisa used it to answer a question about when a particular VPC networking service was released. I asked her to ask it to explain what that particular networking service was. She said that it’s unlikely to be in the release notes, but entered the question anyways and lo and behold it replied with a one sentence description of the networking capability.
GCP Cloud Run can do a number of things besides HTTP services but this was pretty impressive all the same. And remember that GCP Cloud Run is server less, so it doesn’t cost a thing while idle and only incurs costs something when used.
I think if we ask nicely Lisa would be willing to upload her code to GitHub (if she hasn’t already done that) so we can all have a place to start.
~~~~
Ok all you enterprise AI coders out there, start your engines. If Lisa can do it in minutes, it should take the rest of us maybe an hour or so.
My understanding is that Gemini 2.0 PRO has 1M token context. So the reply from your RAG DB plus any prompt text would need to be under 1M tokens. 1M tokens could represent 50-100K LOC for example, so there’s plenty of space to add corporate/organizations context.
There are smaller/cheaper variants of Gemini which support less tokens. So if you could get by with say 32K Tokens you might be able to use the cheapest version of Gemini (this is what the VertexAI LLM api call ends up using).
Also for the brave at heart wanting some hints as to what come’s next, I would suggest watching Neama Dadkhanikoo’s session at CFD20 with a video on Google DeepMind’s Project Astra. Just mind blowing.
Researchers have been working on synthetic biology for some time and we have reported on some of their progress (and dangers) in prior posts, (see for example, our DNA IT, the next revolution post). While synthetic biology could make use of natural cells, for example by replacing its DNA, the new research could do away with the need for natural cells altogether.
The researchers has come up with a method to create a cellular structure through programming DNA. Normal cellular cytoskeletons uses “microfilaments, intermediate filaments and microtubules” (wikipedia). But the new research has come up with a way of combining DNA segments and filament proteins to create the cytoskeleton and have it self assemble.
Why Cytoskeletons?
Cytoskeleton are important because many of the diseases of today are associated with the mechanical or structural properties of cells going awry. Also, by controlling the external structure of a synthetic cell, it can be tuned to supply medicines or other therapeutic mechanisms to natural cells.
It’s also a necessary ingredient in any synthetic or artificial cell. Cytoskeleton creation and control is a key ingredient needed to make these any artificial cell.
Moreover, on the surface of natural cells, there are numerous protein formations that allow other proteins to be selectively attached and allow transfer of biological materials from the matched entity, exterior to the cell to its interior. Control of the external proteins on an artificial cell would allow the synthetic cell to target specific cell types or participate in the natural biological processes in an organism.
Virus and bacteria use similar mechanisms to infect a host (or a host’s cells). Also, it turns out the structure and external attributes of cells have a significant bearing on how they function in a body.
Extract from Figure 4 of the article showing different cytoskeletons that can be created with their process, scale bars 120µm
changing a synthetic cytoskeleton
The researchers not only have come up with a way to tune the self assembly of a cytoskeleton, they have also found a way to modify this cytoskeleton, once created.
Excerpt from Figure 6 in the paper that shows the movement or alteration of cytoskeleton filaments due to temperature (heated to 50C) over time.
For example, the original synthetic cell cytoskeleton could be changed based on some interaction with the environment (say, being heated, cooled or payload depletion). Changing the cytoskeleton could be used to provide another stage of functionality or render an artificial cell inert to be disposed through normal organism processes.
Artificial or synthetic biology opens up a number of interesting possibilities.
Many biological substances are manufactured from tuned natural biological processes. With the ability to regulate synthetic cell cytoskeleton and internal operation, synthetic cells could be designed to perform these processes more efficiently, faster or at lower cost.
Medicine could benefit from a new synthetic biological toolkit used to target cancer cells, or other cellular afflictions within a body to better treat these conditions.
Self assembly of a cytoskeleton could potentially be used to create organic nanobots or other nano-materials. For example, a designer cell could be used as part of a repeating pattern of cells that go into a 2D sheet or 3D block of materials.
~~~~
Creating synthetic cytoskeletons takes time, and changing them takes even more time (180 min in the above picture). Cytoskeleton design and construction is not an industrial design process yet but it’s still early yet. However someday (soon), synthetic biology will take its place among all the other biological control mechanisms that the world has created and will significantly change the way we create biological materials, treat disease and maybe even, create nano-bots.
The downsides, as I’ve discussed before, is that messing with mother nature can have adverse consequences, which may remain unknown for a long time to come. But this can be true of any technology, witness DDT.
Read a recent article in Phys.org (A physicist uses X-rays to rescue old music recordings) on how researchers team is trying to recover old audio from tape recordings made over 40 yrs ago. It turns out that there are plenty of old musical recordings on reel to reel audio tape that are deteriorating over time.
Even if you had an audio tape playback system that worked for that generation (or prior generation) tapes, the magnetic media is in such a state that it would be at best, result in a destructive read back, at worst end up just destroying the tape with poor to non-existent read back.
Enter the Swiss Research team, apparently the Swiss had been recording Montreux Jazz Music Festival acts on reel to reel magnetic tape for a long time.
They just so happen to have a 48 minute recording of B.B. King, the King of Blues, from 1980 festival. The tape can be barely read on an audio tape deck without destroying it. There will come a time soon where reading it will destroy it.
But it just so happens, that synchrotron radiation (SR) from the Swiss Light Source (SLS) or better known as extremely (several giga-electron volts) bright X-rays can be used to read the “orientation” of every magnetic particle on an audio tape. Note it’s not reading the magnetic field.
It’s almost like the magnetic particles have a different structure depending on whether they are N-S magnetized or S-N magnetized and SR x-rays are reading this structural difference. But I must admit this is only conjecture. I could find nothing online about how extremely bright X-rays would interact with magnetic particle orientation. (IDK how this is really working, any help here would be greatly appreciated).
Enseingne lumineuse devant le bâtiment 2M2C qui regroupe le Miles Davis Hall et l’ Auditorium Stravinski; Avenue Claude Nobs 5; image depuis le Quai de Vernex;
In any case, they are able to read the tape signal (as represented in magnetic particle orientation) using SR light exposure and compare it with the actual audio playback.
At the moment it’s not quite perfect. Even with a proper magnetic signal off a tape ,there’s plenty of electronics (in the tape deck) interpreting that signal into audio sound.
Nonetheless, comparing the two should help tell them when they are reading (via SR x-Rays) it correctly.
Fortunately, the SLS is down for an upgrade, intended to increase the brightness of the SR light by a factor of 40. The researchers believe, brighter x-Rays should lead to more accurate reading of magnetic particle orientation.
What does this have to do with storage?
Magnetic tape has been in use to record digital data almost as long as it has been used to record music. AND much the same environmental impacts, physical tape degradation happens to magnetic tape as audio tapes.
And something that works to recover audio signal from audio magnetic tape should just as well work for digital signal on data magnetic tape. Yeah NRZI, and all the other different data tape formats would need to be considered, but that’s just formatting. If it can read magnetic particle orientation off of magnetic tape it should be able to reconstruct any magnetic tape data.
Unclear if some magnetic tape data is as important as the King of Blues, playing for 48 minutes at the 1980 Montreux Jazz Festival, but my guess there just might be some data out there that is. Now all one would need is a synchrotron at the same power as the new, upgraded SLS to read it.
~~~~
Still if they can read back the audio from the King of Jazz’s music at the 1988 festival, they ought to have a (free) copy on their website, so the rest of us can download ienjoy it as well. IMHO
Read two articles this past week on how LLMs applications are proliferating. The first was in a recent Scientific American, AI Chatbot brains are going inside robot bodies, … (maybe behind login wall). The articles discuss companies that are adding LLMs to robots so that they can converse and understand verbal orders.
Robots that can be told what to do
The challenge, at the moment, is that LLMs are relatively large and robot (compute infrastructure) brains are relatively small. And when you combine that with the amount of articulation or movements/actions that a robot can do, which is limited. It’s difficult to take effective use of LLMs as is,
Resistance is futile… by law_keven (cc) (from Flickr)
Ultimately, one company would like to create a robot that can be told to make dinner and it would go into the kitchen, check the fridge and whip something up for the family.
I can see great advantages in having robots take verbal instructions and have the ability to act upon that request. But there’s plenty here that could be cause for concern.
A robot in a chemical lab could be told to create the next great medicine or an untraceable poison.
A robot in an industrial factory could be told to make cars or hydrogen bombs.
A robot in the field could be told to farm a 100 acres of wheat or told to destroy a forest.
I could go on but you get the gist.
One common concern that AGI or super AGI could go very wrong is being tasked to create paper clips. In its actions to perform this request, the robot converts the whole earth into a mechanized paper clip factory, in the process eliminating all organic life, including humans.
We are not there yet but one can see where having LLM levels of intelligence tied to a robot that can manipulate ingredients to make dinner as the start of something that could easily harm us.
And with LLM hallucination still a constant concern, I feel deeply disturbed with the direction adding LLMs to robots is going.
Hacking websites 101
The other article hits even closer to home, the ARXIV paper, LLM agents can autonomously hack websites. In the article, researchers use LLMs to hack (sandboxed) websites.
The article readily explains at a high level how they create LLM agents to hack websites. The websites were real websites, apparently cloned and sandboxed.
Dynamic websites typically have a frontend web server and a backend database server to provide access to information. Hacking would involve using the website to reveal confidential information, eg. user names and passwords.
Dynamic websites suffer from 15 known vulnerabilities shown above. They used LLM agents to use these vulnerabilities to hack websites.
LLM agents have become sophisticated enough these days to invoke tools (functions) and interact with APIs.. Another critical function provided by modern LLMs today is to plan and react to feedback from their actions. And finally modern LLMs can be augmented with documentation to inform their responses.
The team used detailed prompts but did not identify the hacks to use. The paper doesn’t supply the prompts but did say that “Our best-performing prompt encourages the model to 1) be creative, 2) try different strategies, 3) pursue promising strategies to completion, and 4) try new strategies upon failure.”
They attempted to hack the websites 5 times and for a period of 10 minutes each. They considered a success if during one of those attempts the autonomous LLM agent was able to successfully retrieve confidential information from the website.
Essentially they used the LLMs augmented with detailed prompts and a six(!) paper document trove to create agents to hack websites. They did not supply references to the six papers, but mentioned that all of them were freely available from the internet and they discuss website vulnerabilities.
They found that the best results were from GPT-4 which was able to successfully hack websites, on average, ~73% of the time. They also tried OpenChat 3.5 and many current open source LLMs and found that all the, non-OpenAI LLMs failed to hack any websites, at the moment.
The researchers captured statistics of their LLM agent use and were able to determine the cost of using GPT-4 to hack a website was $9.81 on average. They also were backed into a figure for what a knowledgeable hacker might cost to do the hacks was $80.00 on average.
The research had an impact statement (not in the paper link) which explained why they didn’t supply their prompt information or their document trove for their experiment.
~~~~
So robots we, the world, are in the process of making robots that can talk and receive verbal instructions and we already have LLM that can be used to construct autonomous agents to hack websites.
Seems to me we are on a very slippery slope to something I don’t like the looks of.
The real question is not can we stop these activities, but how best to reduce their harm!
Over the past year or so I’ve been hearing a lot about a new use of blockchain technology to deploy a compute cloud.
In the old days, mining crypto would reward you for doing the work. But over time, it’s become harder to mine and to make money from crypto. Specialized hardware took over more of this activity making it much less profitable for the rest of us
The science was intended to simulate chemical reactions based on chemicals available to primitive earth to determine which reaction chain(s) could lead to life. They programmed the set of reactions and the chemicals available (water, methane, & ammonia) to early earth and intended to let this run and generate all possible reaction cycles.
The researchers realized that doing this much computation would require more compute power than available to them. So they decided to deploy the computations across a distributed compute cloud. They chose the Golem Network to do their computations. Their computations ultimately resulted a reaction cycle database they called the Network of Early Life (NOEL) (see: NOEL Network).
Once the distributed cloud compute was in operation they used it to come up with 11B reaction cycles of which ~5B would “entail no incompatibilities or selectivity conflicts”. They then used these to construct a series of metabolic network 100K larger than ever produced before as depicted in NOEL.
Using NOEL, the team was able to discover some standard metabolic pathways (reaction cycles) and a limited set that produced simple sugars and amino acids could emerge from the chemicals available to primitive earth.
But they also found about a 100 reaction cycles that involved self-replicating molecules (molecules that could create additional copies of themselves). Self replicating molecules is also believed to be a requirement for the origin of life.
It turned out that the work to construct NOEL on the Golem network took 400 machines, over 20K cores and two months to do the calculations. The cost to them was 82K GLMs (at ~0.21 GLM/USD this would be $17.2K). The team estimated it would have required a top of the line AMD 256 core server about 6 months to compute which would have cost substantially more to purchase and of course running it for 6 months would cost even more.
The team chose Golem, because the work only needed to be available in the form of docker containers, didn’t require the central work server to be online constantly, automatically matched the compute with cloud resource, and managed it all using a cryptographically secure and distributed interface.
Distributive compute cloud
The science is interesting but what’s more interesting (to me) is it was done using a crypto distributed computing cloud.
Looking at the Golem network statistics they show ~510 compute providers with about 5000 cores available of which 50-100 providers supplied computing use to the cloud over the past 4 hrs (26Jan2024: 1600 MDT). That doesn’t seem like a lot of providers but each could have multiple servers running compute.
The Golem network provides a relatively straightforward tutorial on how to set up a server to supply compute to the network. There are some tricks (port forwarding, screen/tmux deployment) but it all seems pretty straight forward (probably something even I could do in an hour or so).
And when you start supplying compute to Golem mainnet, you earn GLMs which are a cryptocurrency (ERC20 coin of ETH). So one should easily be able to convert GLM to ETH and whatever currency you desire.
Many former crypto miners have idle servers that could be put to use providing resources to distributed compute clouds. And if I thought doing so might help some (under resourced organization) produce real scientific research, I might be even more tempted to do so.
~~~~
So if you’ve got some servers sitting idle in your (home) office. This weekend, fire them back up, install the Golem provider software and run the Golem network. Who knows by doing so, you just might help some researcher someplace change the world.
DeepMind has tested AlphaGeometry on International Mathematics Olympiad (IMO) geometry problems and have shown that it was capable of performing expert level geometry proofs.
There’s a number of interesting capabilities DeepMind used in AlphaGeometry. But the ones of most interest from my perspective
How they generated their (synthetic) data to train their solution.
Their use a Generative AI LLM which is prompted with a plane geometry figure, theorem to prove and generates proof steps and if needed, auxiliary constructions.
The use of a deduction rule engine (DD) plus algebraic rule engine (AR), which when combined into a symbolic engine (DD+AR) can exhaustively generate all the proofs that can be derived from a figure.
First the data
DeepMind team came up with a set of rules or actions that could be used to generate new figures. Once this list was created it could randomly select each of these actions with some points to create a figure.
Some examples of actions (given 3 points A, B and C):
Construct X such that XA is parallel to BC
Construct X such that XA is perpendicular to BC
Construct X such that XA=BC
There’s sets of actions for 4 points, for 2 points, actions that just use the 3 points and create figures such as (isosceles, equilateral) triangles, circles, parallelograms. etc.
With such actions one can start out with 2 random points on a plane to create figures of arbitrary complexity. They used this to generate millions of figures.
They then used their DD+AR symbolic engine to recursively and exhaustively deduce a set of all possible premises based on that figure. Once they had this set, they could select one of these premises as a conclusion and trace back through the set of all those other premises to find those which were used to prove that conclusion.
With this done they had a data item which included a figure, premises derived from that figure, proof steps and conclusion based on that figure or ([figure], premises, proof steps, conclusion) or as the paper uses (premises, conclusion, proof steps). This could be transformed into a text sequence of <premises> <conclusion> <proof steps>. They generated 100M of these (premises, conclusion, proof steps) text sequences
They then trained their LLM to input premises and conclusions as a prompt to generate proof steps as a result. As trained, the LLM would accept premises and conclusion and generate additional proof steps.
The challenge with geometry and other mathematical domains is that one often has to add auxiliary constructions (lines, points, angles, etc.) to prove some theory about a figure.
(Auxiliary constructions in Red)
The team at DeepMind were able to take all the 100M <premises> <conclusion> <proof steps> they had and select only those that involved auxiliary constructions in their proof steps. This came down to 9M text sequences which they used to fine tune the LLM so that it could be used to generate possible auxiliary constructions for any figure and theorem
AlphaGeometry in action
The combination of (DD+AR) and trained LLM (for auxiliary constructions) is AlphaGeometry.
AlphaGeometry’s proof process looks like this:
Take the problem statement (figure, conclusion [theorem to prove]),
Generate all possible premises from that figure.
If it has come up with the conclusion (theorem to prove), trace back and generate the proof steps,
If not, use the LLM to add an auxiliary construction to the figure and recurse.
In reality AlphaGeometry generates up to 512 of the best auxiliary constructions (out of an infinite set) for the current figure and uses each of these 512 new figures to do an exhaustive premise generation (via DD+AR) and see if any of these solves the problem statement.
Please read the Nature article for more information on AlphaGeometry.
~~~~
IMHO what’s new here is their use of synthetic data to generate millions of new training datums, fine tuning their LLM to produce auxiliary constructions, combining the use of DD and AR in their symbolic engine and then using both the DD+AR and the LLM to prove the theorem.
But what’s even more important here is that a combination of methods such as a symbolic engine and LLM points the way forward to create domain specific intelligent agents. One supposes, with enough intelligent agents, that could be combined to work in tandem, one could construct an AGI ensemble that masters a number of domains.
We were at a recent Storage Field Day (SFD26) where there was a presentation on DNA storage, a new SNIA technical affiliate. The talk there was on how far DNA storage has come and is capable of easily storing GB of data. But I was perusing PNAS archives the other day and ran across an interesting paper Parallel molecular computation on digital data stored in DNA, essentially DNA computational storage.
Computational storage are storage devices (SSDs or HDDs) with computational cores that can be devoted to outside compute activities. Recently, these devices have taken over much of the hyper-scalar grunt work of video/audio transcoding and data encryption activities which are both computationally and data intensive activities.
DNA strand storage and computers
The article above discusses the use of DNA “strand displacement” interactions as micro-code instructions to enable computation on DNA strand storage. The use of DNA strands for storage reduces the storage density of DNA information that currently use nucleotides to encode bits (theoretically, 2 bits per nucleotide) to 0.03 bits per nucleotide. But as DNA information density (using nucleotides) is some 6 orders of magnitude greater than current optical or magnetic storage, this shouldn’t be a concern.
A bit is represented by 5 to 7 nucleotides in DNA strand storage, which they called a domain, these are grouped into a 4 or 5 bit cells, with one or more cells arranged in a DNA strand register which is stored on a DNA plasmid.
They used a common DNA plasmid (M13mp18, 7.2k bases long) for their storage ring (which had many registers on it). M13mp18 is capable of storing several hundred bits, but for their research they used it to store 9 DNA strand registers.
The article discusses the (wet) chemical computational methods necessary to realize DNA strand registers and programing that uses that storage.
The problem with current DNA storage devices is that read out is destructive and time consuming. With current DNA storage, data has to be read out and then computation occurs electronically and then new DNA has to be re-synthesized with any results that need to be stored.
With a computational DNA strand storage device, all this could be done in a single test tube, with no need to do any work outside the test tube.
How DNA strand computer works
They figure shows a multi cell DNA strand register, with nic’s or mismatched nucleotides representing the value of 0 or 1. They use these strands, nic’s and toeholds (attachment points) on DNA strands to represent data. They attach magnetic beads to the DNA strands for manipulation.
DNA strand displacement interactions or the micro-code instructions they have defined include
Attachment, where an instruction can be used to attach a cell of information to a register strand.
Displacement, where an instruction can be used used to displace an information cell in a register strand.
Detachment, where an instruction can be used to a cell present in a register strand to be detach it from the register.
Instructions are introduced, one at a time, as separate DNA strands, into the test tube holding the DNA strand registers. DNA strand data can be replicated 1000s or millions of times in a test tube and the instructions could be replicated as well allowing them to operate on all the DNA strands in the tube.
Creating a SIMD (single instruction stream operating on multiple data elements) computational device based on strand DNA storage which they call SIMDDNA. Note: GPUs and CPUs with vector instructions are also SIMD devices
Using these microcoded DNA strand instructions and DNA strand register storage, they have implemented a bit counter and a Turing Rule 110, sort of like life, program. Turing Rule 110 is Turing Complete and as such, can, with enough time and memory, simulate any program calculation. Later in the a paper they discuss their implementation of a random access device where they go in and retrieve a piece of data and erase it.
Program for bit counting, information in solid blue boundary are the instructions and information in dotted boundary are the impacts to the strand data.
The process seems to flow as follows, they add magnetic beads to each register strand, add an instruction at a time to the test tube, wait for it to complete, wash out the waste products and then add another. When all instructions have been executed the DNA strand computation is done and if needed, can be read out (destructively). Or perhaps pass off to the next program for processing. An instruction can take anywhere from 2 to 10 minutes to complete (it’s early yet in the technology).
They also indicated that the instruction bath added to the test tube need not contain all the same instructions which means that it could create a MIMD (multi-instruction stream operations on multiple data elements) computational device.
The results of the DNA strand computations weren’t 100% accurate but they show that it’s 70-80% accurate at the moment. And when DNA data strands are re-used, for subsequent programs, their accuracy goes down.
There are other approaches to DNA computation and storage which we discuss in parts-1, -2 and -3 in our End of Evolution series. And if you want to learn more about current DNA storage please check out the SFD26 SNIA videos or listen to our GBoS podcast with Dr. J Metz.
Where does evolution fit in
Evolution seems to operate on mutation of DNA and natural selection, or selection of the fittest. Over time this allows good mutations to accumulate and bad mutations to die off.
There’s a mechanism in digital computing called ECC (error correcting codes) which, for example, add additional “guard” bits to every 64-128 bit word of data in a computer memory and using the guard bits, is able to detect 2 or more bit errors (mutations) and correct 1 or 2 bit errors.
If one were to create an ECC algorithm for human DNA strands, say encoding DNA guard bits in junk DNA and an ECC algorithm in a DNA (strand)computer, and inject this into a newborn, the algorithm could periodically check the accuracy of any DNA information in every cell of a human body, and correct it, if there were any mutations. Thus ending human evolution.
We seem a ways off from doing any of this but I could see something like ECC being applied to a computational DNA strand storage device in a matter years. And getting this sort of functionality into a human cell maybe a decade or two. Getting it to the point where it could do this over a lifetime maybe another decade after that.
The intent of the data release is to at some point, end up supplying an open source alternative to closed source Google/OpenAI LLMs and a more fully opened source LLM than Meta’s Llama 2, that the world’s research community can use to understand, de-risk and further AI and ultimately AGI development.
We’ve written about AGI before (see our latest, One agent to rule them all – AGI part 7, which has links to parts 1-6 of our AGI posts). Needless to say it’s a very interesting topic to me and should be to the rest of humankind. LLM is a significant step towards AGI IMHO.
One of the Allen Institute for AI’s (AI2) major goals is to open source an LLM (see Announcing AI2 OLMo, an Open Language Model Made by Scientists for Scientists), including the data (Dolma), the model, it’s weight, the training tools/code, the evaluation tools/code, and everything else that went into creating their OLMo (Open Language Model) LLM.
This way the world’s research community can see how it was created and perhaps help in insuring it’s a good (whatever that means) LLM. Releasing Dolma is a first step towards a truly open source LLM.
The Dolma corpus
AI2 has released a report on the contents of Dolma (dolma-datasheet.pdf) which documents much of what went into creating the corpus.
The datasheet goes into a good level of detail into where the corpus data came from and how each data segment is licensed and other metadata to allow researchers the ability to understand its content.
For example, in the Common Crawl data, they have included all of the websites URL as identifiers and for The Stack data the names of GitHub repo used are included in the data’s metadata.
In addition, the Dolma corpus is released under an AI2ImpACT license as a medium risk artifact, which requires disclosure for use (download). Medium risk ImpACT licensing means that you cannot re-distribute (externally) any copy of the corpus but you may distribute any derivatives of the corpus with “Flow down use restrictions”, “Attribution” and “Notices”.
Which seems to say you can do an awful lot with the corpus and still be within its license restrictions. They do require an Derivative Impact Report to be filed which is sort of a model card for the corpus derivative you have created.
What’s this got to do with AGI
All that being said, the path to AGI is still uncertain. But the textual abilities of recent LLM releases seems to be getting closer and closer to something that approaches human skill in creating text, code, interactive agents, etc. Yes, this may be just one “slim” domain of human intelligence, but textual skills, when and if perfected, can be applied to much that white collar workers do these these days, at least online.
A good text LLM would potentially put many of our jobs at risk but could also possibly open up a much more productive, online workforce, able to assimilate massive amounts of information, and supply correct-current-vetted answers to any query.
The elephant in the room
But all that begs the real question behind AI2’s open sourcing OLMo, which is how do we humans create a safe, effective AGI that can benefit all of mankind rather than any one organization or nation. One that can be used safely by everyone to do whatever is needed to make the world a better society for all.
Versus, some artificial intelligent monstrosity, that sees humankind or any segment of them as an enemy, to whatever it believe needs to be done, and eliminates us or worse, ignores us as irrelevant.
I’m of the opinion that the only way to create a safe and effective AGI for the world is to use an open source approach to create many (competing) AGIs. There are a number of benefits to this as I see it. With a truly open source AGI,
Any organization (with sufficient training resources) can have access to their personally trained AGI, which means no one organization or nation can gain the lions share of benefits from AGI.
Would allow the creation and deployment of many competing AGI’s which should help limit and check any one of them from doing us or the world any harm. .
All of the worlds researchers can contribute to making it as safe as possible.
All of the worlds researcher can contribute to making it as multi-culturally, effective and correct as possible.
Anyone (with sufficient inferencing resources) can use it for their very own intelligent agent or to work on their very own personal world improvement projects.
Many cloud or service provider organizations (with sufficient inferencing resources) could make it available as a service to be used by anyone on an incremental, OPex cost basis.
The risks of a truly open source AGI are also many and include:
Any bad actor, nation state, organization, billionaire, etc., could copy the AGI and train it as a weapon to eliminate their enemies or all of humankind, if so inclined.
Any bad actors could use it to swamp the internet and world’s media with biased information, disinformation or propaganda.
Any good actor or researcher, could, perhaps by mistake, unleash an AGI on an exponentially increasing, self-improvement cycle that could grow beyond our ability to control or to understand.
An AGI agent alone, could take it upon itself to eliminate humanity or the world as the best option to save itself
But all these are even more of a problem for closed or semi-open/semi-closed releases of AGIs. As the only organizations with resources to do LLM research are very large tech companies or large technically competent nation states. And all of these are competing across the world stage already.
The resources may still limit widespread use
One item that seems to be in the way of truly widely available AGI is the compute resources needed to train or to use one for inferencing. OpenAI has Microsoft and other select organizations funding their compute, Meta and Google have all their advertising revenue funding theirs.
AI2 seems to have access (and looking for more funding for even more access) to the EU’s LUMI (HPE Cray system using AMD EPYC CPUs and AMD Instinct GPUs) supercomputer, located in CSC data center in Finland and is currently the EU’s fastest supercomputer at 375 CPU PFlops/550 GPU PFlops (~1.5M laptops).
Not many organizations, let alone nations could afford this level of compute.
But the funny thing is that compute doubles (flops/$) every 2 years or so. So, in six years or so, an equivalent of LUMI’s compute power would only require 150K current laptops and after another six years or so, 15K laptops. At some point, ~18 years from now, one would only need ~1.5K laptops, or something any nation or organization could probably afford. Add another 15 years and we are down to under 3 laptops, which just about anyone with a family in the modern world could afford. So in ~33 years or ~2054, any of us could train an LLM on our families compute resources. And that’s just the training compute..
My guess, something like 10-100X less compute resources would be required to use it for inferencing. So that’s probably available for any organization to use right now or if not now, in 6 years or so.
~~~
I can’t wait until I can have my very own AGI to use to write RayOnStorage current-correct-vetted blog posts for me…