Does GPT-4 have theory of mind capabilities?
FANToM: A new benchmark for stress-testing machine theory of mind in interactions
Hyunwoo Kim / November 30, 2023
A few months ago, debates around whether contemporary large language models (LLMs) are showing theory of mind capabilities or not sparked the media. Theory of mind (ToM), the ability to ascribe mental states to others, is one of the hallmarks of human social reasoning. It includes understanding others’ beliefs, desires, intentions, and thoughts, all of which play a significant role in our daily social interactions.
In this blog post, we delve deeper into the following question: "Do LLMs have a theory of mind?" Our recent benchmark FANToM, accepted to EMNLP 2023 as an oral presentation, analyzes theory of mind capabilities of thirteen state-of-the-art LLMs based on essential criteria from psychology and the LLM evaluation literature for validating theory of mind in LLM interactions. We show that NONE of the existing LLMs show signs of coherent ToM capabilities, including GPT-4.
But didn't LLMs manage to solve some ToM tests before?
Yes, they did. However, there are several issues inherent in those evaluation setups. To begin with, existing evaluations for LLMs primarily use situation descriptions (i.e., narratives) as the target domain. Since narratives condense situation information into short texts, the process of deciding what to include or exclude in the text can introduce reporting bias, resulting in artifacts that models can easily exploit. For instance, including "Carlos did not see this, so he does not currently know where the apple is" in a test that asks about the locations where Carlos might search for the apple provides a significant clue that compromises the evaluation protocol. Moreover, many of them are adapted from famous ToM test sets in psychology (e.g., Sally-Anne test, Smarties test), which likely have already been encountered in the pre-training data of LLMs.
Then what does FANToM suggest for evaluating ToM in LLMs?
We ground our FANToM benchmark directly in interactions - i.e., conversations. In contrast to narratives, conversations present interactions in their raw form, without those explicit hints about others' mental states. During conversations, we reason through the intermediate steps from scratch, thereby grounding the benchmark in conversations and enabling a more realistic and unbiased assessment of ToM.
In particular, we construct FANToM by leveraging information asymmetry in conversational contexts. It consists of multi-party conversations centered around a certain topic (e.g., pets, family). As the conversation progresses, characters join and leave the discussion and the conversation's subtopic changes over time. During the absence of a character, the conversation continues and information is shared among the remaining participants, creating a natural information asymmetry that reflects real-life interactions. After a series of utterances, the character who was absent (re)joins the conversation, unaware of the information that was previously shared with other participants.
On top of this asymmetry, we build fact questions and convert them to multiple challenging belief questions: (1) BeliefQ (choice and free-response types), (2) AnswerabilityQ (list and binary types), and (3) InfoAccessQ (list and binary types). All of these questions require the same underlying theory of mind (ToM) reasoning: "Who is aware of the information in the conversation." This design is drawn upon important requisites from both psychology and the AI literature that should be considered when testing LLMs for ToM.
What are the results?
1. LLMs do not have a coherent theory of mind.
All SOTA LLMs exhibit scores that are significantly worse than human performance. We find models perform significantly better on BeliefQ[Choice] compared to AnswerabilityQ[List] and InfoAccessQ[List]. Despite the AnswerabilityQ[List] and InfoAccessQ[List] being prerequisites for solving BeliefQ[Choice], they are much more challenging for models. Furthermore, models' performance sharply drops when evaluated for coherent reasoning across multiple question types with the same underlying theory of mind (ToM) reasoning (i.e., All Question Types). These findings suggest that some instances of successful LLM ToM reasoning should be interpreted as illusory.
2. LLMs are tricked by their own use of shortcuts.
The token F1 scores for FactQ shows the model's basic comprehension capability for interactions. Scoring high in FactQ indicates the model is good at identifying the most relevant information piece to answering the question. Meanwhile, to meet the mentalizing criterion, we deliberately design the incorrect answers in BeliefQ[Dist.] to have greater word overlap with the context than correct answers. Also, BeliefQ[Dist.] and FactQ share significant word overlap. Thus, if the model mindlessly copies the most relevant information piece to answering the belief question as well, it will score low accuracy.
3. Chain-of-thought and straight-forward fine-tuning is not enough.
We observe an improvement in scores with zero-shot chain-of-thought (CoT) applied. However, there are still significant score gaps compared to human performance. Our benchmark is not intended for training purposes, but we also fine-tune (FT) Flan-T5 XL on FANToM to see how much it gains performance. Although the model shows a significant improvement in individual question types, it does not exhibit coherent ToM reasoning.
4. Even the errors they make are inconsistent.
We analyze the error types of AnswerabilityQ and InfoAccessQ for each model with and without chain-of-thought (CoT). (1) For list-type questions, models make more errors by including characters who are unaware (i.e., false positive) of the information in the responses, rather than excluding characters who are aware (i.e., false negative). (2) In the case of binary questions, models tend to exhibit false negative responses more frequently for binary questions compared to list-type questions. An interesting observation is that CoT primarily helps the model in reducing the false positive error rates, but it does not do so for false negative error rates for both list and binary-type questions.
Conclusion
Although there have been recent debates around the ToM capabilities of LLMs, our results indicate that this capacity has not yet emerged in any manner. With the increasing deployment of LLMs in interactive settings with users, we at AI2 believe it is essential to demystify exaggerated claims regarding the capabilities of current LLMs and make this information accessible to the public.
Please check out our resources if you're interested in more details:
- Our paper on Semantic Scholar
- Code and dataset: https://github.com/skywalker023/fantom