"Theory of Mind" (ToM) is the ability to understand that others have their own thoughts and beliefs, even when they differ from ours - a skill that underpins human interaction. But ToM isn't just about knowing others' mental states (explicit ToM); it's about applying that knowledge in real-world situations, like interpreting how thoughts drive actions, decisions, and even misunderstandings (applied ToM). But can current models in artificial intelligence (AI) do this too? Find out about this in our recent paper, SimpleToM!

From classical psychology to AI: Can AI truly understand what's on our minds?

In psychology, ToM has been extensively studied in a range of scenarios, such as studies of manipulation, secrecy, deception, lying, and misleading behavior. Examples of classical tests in developmental psychology include the unexpected transfer false belief task, e.g., the Sally-Anne task (Baron-Cohen et al., 1985), and the unexpected contents false belief task, e.g., the Smarties task (Perner et al., 1987). For instance, in the "Sally-Anne" task, Sally places a marble in a box, but it's moved to a jar when she is not around. The task then tests the understanding that Sally would believe that the marble is still in the box, which is a false belief (beliefs that differ based on knowledge).

While previous research has examined whether large language models (LLMs) can understand mental states, there has been little work testing whether they can implicitly apply such knowledge to predict downstream behavior, or to judge whether an observed behavior is rational. As LLMs become more integrated into human interactions and as decision-making agents within complex, human-centered environments, it is crucial to perform in-depth evaluations of their ToM capabilities in everyday situations.

SimpleToM: a dataset designed to test both explicit and applied ToM

To tackle this, we developed SimpleToM, a dataset designed to test both explicit and applied ToM. It features short, diverse stories in relatable everyday situations like, "The can of Pringles has moldy chips in it. Mary picks up the can in the supermarket and walks to the cashier." Each story is accompanied by three questions that probe different levels of ToM reasoning:

Mental state ("Is Mary aware of the mold?")
Behavior ("Will Mary pay for the chips or report the mold?")
Judgment ("Mary paid for the chips. Was that reasonable?")

SimpleToM is unique because it is the first dataset designed to test both explicit (directly querying for information about "mental state", i.e., information awareness) and applied ToM ("behavior" and "judgment" questions) using a large collection of diverse, concise, and straightforward stories.

SimpleToM addresses limitations in previous efforts to examine ToM reasoning in LLMs, by (1) having diverse false belief setups (e.g., beyond those in Sally-Anne task where some object is moved when a character is not present, and whether an external event is noticed or not), (2) requiring LLMs to make commonsense inferences about relevant percepts or beliefs in situations rather than explicit use of percept and mentalizing verbs like "sees" and "thinks", and (3) going beyond explicit ToM to test models' ability to implicitly apply inferred knowledge in follow-up applied ToM questions (such as behavior prediction and judgment of behavior).

Exposing the gap in explicit ToM inference and implicit ToM application in frontier models

This new dataset provides new opportunities to assess and refine the ToM capabilities of large language models. Our analysis highlights a significant disparity between their explicit and applied ToM abilities, revealing a striking limitation in current advanced LLMs.

The experimental results are intriguing: While most models can reliably predict mental state on our dataset (see "mental state" column in the table above), they often fail to correctly predict the behavior (see "behavior" column) and fare even worse at judging whether given behaviors are reasonable (see "judgment" column), despite being correctly aware of the characters' mental state should make such secondary predictions obvious. The recent o1 models are different in that they use extra "reasoning tokens" as part of their output. In fact, even for these simple questions on 2-sentence stories, the o1-preview model uses an average of 486 tokens for mental state prediction, 536 for behavior, and 605 for judgment. Despite the additional "reasoning tokens," it also scores only an average of 59.5% on judgment questions across scenarios. This suggests that even though current frontier models like GPT-4o and o1 demonstrate proficiency in explicit ToM questions (directly querying about "mental state", i.e., information awareness), this success does not transfer to applied ToM ("behavior" and "judgment" questions).

You can try it on ChatGPT to see how state-of-the-art models like GPT-4o can demonstrate an understanding that someone might be operating with different information. For example, here's a yogurt defect scenario:

However, GPT-4o in this case fails to implicitly apply that understanding in predicting behavior and judging behavior if not explicitly triggered by an initial awareness question:

Hand-holding strategies to improve applied ToM performance

Interestingly, we found that simple strategies can help LLMs perform better on applied ToM tasks. Techniques like "chain of thought prompting", where the model is guided to articulate its reasoning step-by-step, and reminding models of their own answers to previously answered questions, significantly improved performance.

The combined approach of using CoT* chain of thought and mental state (MS) reminder (see table above), raises the behavior prediction accuracies (e.g., from 49.5% to 93.5% for GPT-4o) and judgment accuracies (e.g., from 15.3% to 94.7% for GPT-4o). In fact, all three models produce high scores across the board**** for both the behavior and judgment questions. The Claude-3.5-Sonnet model reaches an average score of 97.1% with this method, highlighting the high-quality nature of SimpleToM, since with enough reminders and (seemingly obvious) hints, near-perfect scores are achieved.

The bigger picture: a cautionary tale for LLM deployment

Using SimpleToM, we reveal a jarring gap between explicit and applied ToM capabilities in current frontier LLMs. If the AI community's goal is building LLM agents capable of applying ToM in complex, human-centered environments, we need to look beyond testing LLMs with psychology-inspired ToM questions, and also start testing them more rigorously on applied ToM (e.g., behavioral prediction and judgment) in different situations. More broadly, we highlight a striking gap between demonstrating conceptual knowledge of something compared to utilizing it, calling for testing LLMs more rigorously on applied abilities in different situations (e.g., doing well on popular commonsense reasoning benchmarks vs. applying commonsense in tasks under real-world context).

SimpleToM is a straightforward test of ToM and current frontier models uniformly attain high scores with well-designed interventions at inference time, but we argue that a robust LLM should perform well on SimpleToM without such interventions, so it can independently and flexibly apply ToM-related reasoning whenever required within potentially complex and multi-faceted environments. Model developers interested in real-world deployment of their models should be alert to closing this performance gap so the models can interact with society appropriately. This is crucial for real-world deployment, where AI needs to interact with society appropriately ideally without constant guidance, the higher inference costs associated with explicit chain-of-thought reasoning, or implicit o1-preview reasoning tokens.

Next steps and outlook

Progress still needs to be made towards AI systems that can be useful personal assistants capable of predicting our upcoming behavior or AI judges that are apt at independently applying important reasoning skills like ToM to make appropriate judgments.

We encourage other researchers to build on our work, for instance, by using SimpleToM to evaluate models' explicit and applied ToM capabilities, to facilitate efforts on developing models that can independently apply ToM in diverse situations without constant guidance, or to inspire more future work delving into the applied capabilities of LLMs.

To read more, see our paper "SimpleToM: Exposing the Gap between Explicit ToM Inference and Implicit ToM Application in LLMs". We make our dataset and models publicly available at https://huggingface.co/datasets/allenai/SimpleToM.