Digital Socrates: Evaluating LLMs through Explanation Critiques
Looking for an interpretable explanation evaluation tool that can automatically characterize the explanation capabilities of modern LLMs? Meet Digital Socrates!
Yuling Gu
A better way of evaluating explanations
While large language models (LLMs) can provide explanations along with their answers, the nature and quality of those explanations are still poorly understood.
In response, our goal is to (i) define a detailed way of characterizing the explanation capabilities of modern models, (ii) create a sizeable, human-verified dataset for this task, and (iii) train a high-performing, open-source critique model, Digital Socrates (DS), that can generate such characterizations automatically.
Explanation critiquing: task design
To achieve this, we first formalize the explanation critiquing task.
Given a question, along with a model-generated explanation and answer, the task involves giving a critique of the model-generated explanation. The first component of the critique is to identify the most significant flaw (if any) in the explanation. Informed by the systematic and disciplined method of Socratic questioning, our critique focuses on flaws along dimensions chosen to cover the different types of Socratic questions.
The critique also contains general and specific suggestions to ensure that each flaw identified is justified with a direction for improvement, rather than being overly critical. The general suggestion is a statement that addresses a likely misconception underlying the flaw, without giving out the answer to the question. The specific suggestion is a more targeted guide towards the right reasoning chain for this particular question. The task also involves providing a quantitative metric on the explanation quality on a scale from 0 to 5, with 0 indicating that the explanation is very wrong and 5 being completely correct.
Digital Socrates' Critique Bank (DSCB)
We then introduce Digital Socrates' Critique Bank, a sizeable, human-verified dataset for this task. Each instance comprises a question, a model-generated explanation and answer, a critique of the model-generated explanation, as well as (any) human annotations collected on that instance.
DS Critique Bank focuses on questions requiring reasoning, in particular science and commonsense reasoning. The explanations are from different models in popular explanation styles. We elicit seed explanation critique data from GPT-4, then perform expert and crowdsourced annotations.
To the best of our knowledge, this is the first dataset of its kind on explanation critiquing, covering nuanced and interpretable (user comprehensible) critiques on different models' explanations and in different explanation styles.
Our model: Digital Socrates (DS)
Using Digital Socrates' Critique Bank, we train open-source, automatic critique models (called Digital Socrates). We fine-tune two critique models DS-7B and DS-13B starting from Llama2–7B-chat and Llama2–13B-chat respectively.
First, we pre-fine-tune on a set of about 50k training questions from ARC and RAINBOW (αNLI, CosmosQA, HellaSwag, Physical IQa, Social IQa and WinoGrande), doing a simple zero-shot question-answering task. Then we further fine-tune on a curriculum of increasing critique quality, starting with silver data from GPT-4, followed by crowdsourced data, and finally expert annotated data.
The impact
Our explanation critiquing task allows for insightful analysis of student models, looking beyond accuracy.
Even when a model gets the answer correct, its reasoning chain can contain varying degrees of flaws. On the other hand, when a model is incorrect in its answer, it could still make some valid points. Further, explanation critiquing allows us to obtain interpretable insights on the different categories of errors in models' reasoning chains.
For instance, in this case study comparing models on Science datasets, looking at the dimensions of explanation flaws tells us that incorrect information is frequent in Llama2–70B. Localizing the flaws further informs us about topics in which the model has incorrect information. The task also provides suggestions to correct the flaw, such as providing the correct information. The general feedback could then offer directions toward streamlining model improvement or serve as a useful retrieval corpus, while the specific feedback helps to correct reasoning for each instance.
The next steps and outlook
In this work, we show how Digital Socrates can, for the first time, provide rich analyses and insights across a range of student models and datasets, without relying on unfaithful proxies, expensive API calls or human annotation.
Our paper provides more analysis on how applying DS on all datasets in DS Critique Bank, across student models, reveals a rich diversity of behavior.
This fills an important gap in evaluation tools for understanding and improving the explanation behavior of models. We encourage other researchers to build on our work by using DS to evaluate model explanations in addition to accuracy on leaderboards and evaluation code bases, explore applying existing DS models on other reasoning types, and develop future generations of DS models.
To read more, see our paper "Digital Socrates: Evaluating LLMs through Explanation Critiques."
We make our dataset and models publicly available on Hugging Face.
You can also watch a presentation of the paper on YouTube.