Do language models have coherent mental models of everyday things? To investigate this, we propose a benchmark dataset on parts and relations of things, called the ParRoT (Parts and Relations of Things) dataset. We treat the task as that of constructing a “parts mental model” for everyday things, and evaluate if language models can accurately judge whether each relation statement is True/False. The dataset contains 11,720 relations (p1, rln, p2) across 100 everyday things, 300 mental models, and 2191 parts.
Relationships encoded include: spatial orientation (part of, has part, inside, contains, in front of, behind, above, below, surrounds, surrounded by, next to), connectivity (connects), and functional dependency (requires, required by).
Using this dataset, our experiments reveal that even SOTA LMs generally have poor mental models (inaccurate and violating basic commonsense constraints) of everyday things, thus providing insight into their apparent knowledge and behavior not previously explored.
Yuling Gu, Bhavana Dalvi Mishra, Peter Clark