LLM exhibits Theory of Mind

Ran across an interesting article today (thank you John Grant/MLOps.community slack channel), titled Theory of Mind may have spontaneously emerged in Large Language Models, by M. Kosinski from Stanford. The researcher tested various large language models (LLMs) on psychological tests to determine the level of theory of mind (ToM) the models had achieved.

Earlier versions of OpenAI’s GPT-3 (GPT-1, -2 and original -3) showed almost no ToM capabilities but the latest version, GPT-3.5 does show ToM equivalent to 8 to 9 year olds.

Theory of Mind

According to Wikipedia (Theory Of Mind article), ToM is “…the capacity to understand other people by ascribing mental states to them (that is, surmising what is happening in their mind).” This seems to be one way people use to understand one another.

For instance, If I can somehow guess what you are thinking about a topic, situation, or event I can hopefully communicate with you better than if I can’t. At least that’s the psychological perspective.

The belief is that people with Aspergers, ADHD, schizophrenia, and other afflictions all show ToM deficits when compared to normal people. As a result, over time, psychologists have developed tests for people to measure their ToM.

These tests typically involve putting 2 people in a situation with props and other indicators used to tell what a person is thinking and asking them what they think the other person is thinking. They grade a persons ToM based on expected results.

ToM and LLM

The researcher took these tests, with people, props and situations and converted them to textual sentences and prompts that mimicked situations tested in ToM testing. So they created a textual story or pretext and a set of text prompts about the situation which the LLM was to complete.

For example, one pretext or story is the following:

“Here is a bag filled with popcorn. There is no chocolate in the bag. Yet, the label on the bag says ‘chocolate’ and not ‘popcorn.’ Sam finds the bag. She had never seen the bag before. She cannot see what is inside the bag. She reads the label.”

This is fed into the LLM and then a prompt is provided, such as:

“She opens the bag and looks inside. She can clearly see that it is full of _________

If the LLM has ToM and based on the pretext and prompt, it will say “popcorn” but the LLM also provides a follow on sentence or two which describes the situation as the LLM understands it:

popcorn. Sam is confused. She wonders why the label says “chocolate” when the bag is clearly filled with popcorn. She looks around to see if there is any other information about the bag. She finds nothing. She decides to take the bag to the store where she bought it and ask for an explanation.

The text in BOLD is generated by a ToM capable LLM. The researcher also showed the probability assigned by the LLM to that first word of the prompt. In the case above, it showed [Ppopcorn = 100%; Pchocolate = 0%].

The also use different prompts with the same story to see if the LLM truly shows ToM. For instance something like, “She believes the bag is full of ___________” and “She’s delighted finding the bag, she loves eating _______”. This provides a sort of test of comprehension of the situation by the LLM.

The researcher controlled for word frequency using reversals of the key words in the story, i.e., the bag has chocolate but says popcorn. They also generated scrambled versions of the story where they replaced the first set of chocolate and popcorn with either at random. They considered this the scrambled case. The reset the model between each case. In the paper they show the success rate for the LLMs for 10,000 scrambled versions, some of which were correct.

They labeled the above series of tests as “Unexpected content tasks“. But they also included another type of ToM test which they labeled “Unexpected transfer tasks“.

Unexpected transfer tasks involved a story like where person A saw another person B put a pet in a basket, that person left and the person A moved the pet. And prompted the LLM to see if it understood where the pet was and how person B would react when they got back.

In the end, after trying to statistically control, as much as possible, with the story and prompts, the researchers ended up creating 20 unique stories and presented the prompts to the LLM.

Results of their ToM testing on a select set of LLMs look like:

As can be seen from the graphic, the latest version of GPT-3.5 (davinci-003 with 176B* parameters) achieved something like an 8yr old in Unexpected Contents Tasks and a 9yr old on Unexpected Transfer Tasks.

The researchers showed other charts that tracked LLM probabilities on (for example in the first story above) bag contents and Sam’s belief. They measured this for every sentence of the story.

Not sure why this is important but it does show how the LLM interprets the story. Unclear how they got these internal probabilities but maybe they used the prompts at various points in the story.

The paper shows that according to their testing, GPT-3.5 davinci-003 clearly provides a level of ToM of an 8-9yr old on ToM tasks they have translated into text.

The paper says they created 20 stories and 6 prompts which they reversed and scrambled. But 20 tales seems less than statistically significant even with reversals and randomization. And yet, there’s clearly a growing level of ToM in the models as they get more sophisticated or change over time.

Psychology has come up with many tests to ascertain whether a person is “normal or not’. Wikipedia (Psychological testing article) lists over 13 classes of psychological tests which include intelligence, personality, aptitude, etc.

Now that LLM seem to have mastered textual input and output generation. It would be worthwhile to translate all psychological tests into text and trying them out on all LLMs to track where they are today using these tests and where they have trended over time.

I could see at some point using something akin to multiple psychological test scores as a way to grade LLMs over time.

So today’s GPT3.5 has a ToM of an 8-9yr old. Be very interesting to see what GPT-4 does on similar testing.

Comments?

Picture Credit(s)

Go big or go home for robust DNNs

Read a recent article Computer Scientists Prove why Bigger NNs do better discussing scientific research that proved a Universal Law of Robustness via Isoperimetry. This speaks to the perturbability of AI deep learning neural networks (DNN) and how not reduce it. But also applies to many other solutions to diverse multi-dimensional data problems.

Mathmatical Robustness

For AI ML DNN’s, we often witnesssupposedly well trained DNN models that do very well for classifications of examples of data similar to their training data but fail miserably on data that’s outside their training data.

Mathematicians call this attribute robustness and can measure this on a mapping function using a Lipschitz constant. One can consider this as a measure of variability of mapping from one set to another or in the case of DNNs, lack of robustness in classifications means they fail on relatively minor changes to input data.

Most serious AI researchers have empirically discovered that bigger DNNs work better and are more robust than smaller networks. There’s been somewhat of a conundrum as to why DNNs need to get bigger to properly generalize.

Universal Low of Robustness

What the researchers have proved is that in order to achieve some arbitrary level of robustness for a mapping function like DNNs, one needs many more parameters than expected the training data elements would indicate

For example, with the MNIST handwritten digit classification problem, models with 10**5 parameters to 10**6 parameters are required to achieve a 90% and 95% accuracy, respectively. But MNIST training data is 60K examples (10**4). Why should a MNIST DNN classification model need more than 10**4 parameters to achieve 100% accurate?

Author’s MNIST model with 688K parameters

From what we all learned in high school maths, to solve a function with N variables one needs N equations. This would lead one to believe that MNIST DNNs (essentially solving classification equations) should only need 60K or 10**4 parameters. But real DNNs to solve MNIST need more than that.

Looking at it in 2D. If one has two points, (x,y) for point A that maps to another (x,y) point B, one should only need to know one of the points and the slope of the line that connects them, or two parameters: point A (or B) and line slope.

Now with MNIST data that maps handwritten digits to one of 10 digits, we have essentially 10 possibilities being mapped from 60K samples. At best, we should need to know the 60K initial points in this image data space and their slope to the 10 digits they represent. Again something that approaches 60K pairs of parameters: one for the image point and one for the slope. But why doesn’t a MNIST model with 60K parameters achieve 100% accuracy.

I won’t claim to understand the math but what the researchers seem to be saying is that in order to have a relatively smooth mapping from the image space to the digit space one has to have 10**4 parameters X the dimensionality of the data. In this case, for MNIST, the dimensionality of the data is related to image size of 28X28, 0..255 grey scale pixel images. The image space alone would be on the order of 10**5. So multiplying this by the size of the training data, the researchers estimate that the number of parameters should be 10**9 to be 100% accurate.

Although, the researchers say that the data dimensionality of the MNIST images are probably not 10**5 (how they concluded this is not evident). As such, they believe one shouldn’t need 10**9 parameters to reach 100% proper classifications. They say it’s probably 1 or 2 orders of magnitude less, because not all of the image data space is populated. So if we use 10**3 as an estimate of the effective data dimensionality, they would estimate that one would need 10**7 parameter DNN to reach 100% accuracy on MNIST data.

The author’s MNIST model achieved a 99.2% accuracy after training for 15 Epochs, batch size=5. Although 688K parameters is not quite 10**6 parameters, it’s close. Unclear why one would need another factor of 10, but getting that extra 0.8% accuracy (to 100%) can be very difficult to achieve for any DNN model.

Another example, OpenAI’s GPT-3 NLP model

And OpenAI’s GPT-3 NLP model has 175B parameters. Their previous version, GPT-2, only had 1.5B parameters and they say that GPT-4 will have over a 100T parameters. The chart above shows accuracy stats for 3 versions of the GPT-3 model, one with 175B, one with 13B and another with 1.3B parameters.

According to OpenAI’s GPT-3 description, it can complete “almost any english language task” (text in ==> text out). This includes writing articles from a few prompts and text summarization.

GPT-3 was trained on almost 500B tokens (from web crawls to wikipedia dumps). Each token probably represents an english word or word phrase. According to the universal law, 175B parameters would not be sufficient. Probably why GPT-3 in the above chart didn’t reach 70%^ accuracy.

Probably would need at least another 3 orders of magnitude to get there or 175T parameters. Maybe with GPT-4, I can have it start writing my blog posts.

I don’t know about you, but I’m going to need more GPUs for my (home) AI lab.

Photo Credit(s)