Contextualized Evaluations: Judging Language Model Responses to Underspecified Queries

July 22, 2025

Chaitanya Malaviya, Joseph Chee Chang, Dan Roth, Mohit Iyyer, Mark Yatskar, Kyle Lo - Ai2

When we ask a language model a question, we often leave out important context. A query like, "Is coffee good for you?" seems straightforward, but a quality response depends on hidden context about the user (e.g., does the user have high blood pressure? Are they pregnant?). Similarly, a query like "Tell me about transformers" could be asking about the neural network architecture, the movie franchise, or the electronic device.

Current methods for evaluating language models often present these "underspecified" queries to evaluators (whether human or LLM-as-judge) without any of this context. This makes the evaluation an ill-posed task, forcing evaluators to make arbitrary judgments that can lead to inconsistent and unreliable conclusions about model quality.

In our recent TACL paper, we introduce Contextualized Evaluations, a simple protocol to make language model evaluations more reliable and insightful by synthetically creating and including context.

The Problem: Evaluating Responses to Underspecified Queries is an Ill-Defined Task

Without knowing the user's attributes (like their background or expertise), their intent, or their criteria for a useful response, it's impossible for an evaluator to definitively choose the "best" answer to an underspecified query. An explanation of transformers that would be perfect for an NLP researcher might be less preferred for a student studying electrical engineering. Without clear evaluation criteria, evaluators may resort to judging responses based on surface-level qualities like style or formatting, rather than on how well the response actually meets the user's needs.

Just How Underspecified Are User Queries?

To understand the scale of this problem, we analyzed 3,580 queries randomly sampled from five widely used language model benchmarks, including Chatbot Arena and MTBench. We found that underspecification is widely prevalent. The vast majority of queries in these datasets are open-ended (76%), while many are also subjective (19%) or incomplete (18%). This means that our current evaluation leaderboards are set up with many underspecified queries.

Our Solution: Adding Context with Follow-up Questions

Our solution is to provide this missing context to evaluators. We represent context as a set of follow-up question-and-answer (QA) pairs, simulating an interactive scenario where a model could ask for clarification before responding. For a query like, "I am making pumpkin pie, can you help me?", the context might include:

Q: Do you have any dietary restrictions?
A: [“None”, “Gluten-free”, “Dairy-free”, “Low sugar”]
Q: How many servings are you planning to make?
A: [“Small (3-6 servings)”, “Medium (8-10 servings)”, “Large (12+ servings)”]

We use large language models to synthetically generate these contexts. Even with simple few-shot prompts, models are highly effective at creating relevant follow-up questions and answers. Human validation of the generated contexts confirmed that these contexts are high-quality: 76% of questions were deemed important, and answer sets were found to be realistic (90%), complete (75%), and diverse, meaning the choices would lead to meaningfully different responses (80%).

Putting Context to the Test

We designed three evaluation settings to measure the impact of context:

Standard Evaluation (NoCtxGen-NoCtxEval): The typical approach where neither the model generating the response nor the evaluator sees any context.
Implicit Context Discovery (NoCtxGen-CtxEval): Models generate "default" responses without context, but the evaluator is given context. This setting helps us discover the implicit assumptions and biases in a model's default behavior.
Adaptive Evaluation (CtxGen-CtxEval): Both the model and the evaluator are given the same context. This setting tests a model's ability to adapt its response to specific user needs.

We ran pairwise evaluations for three different model pairs (such as GPT-4o vs. Gemini-1.5-Flash) across 1,881 queries, collecting judgments from both human evaluators and LLM-judges/autoraters.

Main Findings: Context Changes Evaluation Conclusions

Finding #1: Evaluator agreement increases and rankings can flip

When context is provided, evaluators have a clearer, more objective basis for their judgments. This significantly improves agreement among both human and AI evaluators by an absolute 3-10%.

Notably, providing context can substantially alter model win rates and even flip the rankings between models. In our experiment, one model beat another in the standard, context-agnostic setting (NoCtxGen-NoCtxEval), but became the decisive winner when both models were asked to adapt to specific contexts (CtxGen-CtxEval). This suggests that current leaderboards may not reflect how well models perform when queries are more specific or require personalization.

Finding #2: More substantive judgments based on content rather than style

Context also changes what evaluators prioritize. We analyzed the free-text justifications for over 2,700 judgments and found that providing context nudges evaluators to focus less on surface-level criteria (like style, tone, and conciseness) and more on content-level criteria (like relevance, correctness, and adherence to the user's needs).

Finding #3: Default model responses show a WEIRD bias

Using our "Implicit Context Discovery" setup, we can investigate the question: Who are models built for by default? Our findings suggest a potential bias in default model responses toward WEIRD (Western, Educated, Industrialized, Rich, and Democratic) contexts. For example, when generating responses without explicit instructions, models tend to produce answers better aligned with:

Users from Western cultural contexts.
High-income individuals.
Users who are young or middle-aged adults.
People with a basic understanding of a topic, rather than experts.

This analysis reveals disparities in how well default responses serve different user populations, an insight that is missed in standard evaluations.

Why This Matters for Using and Evaluating LLMs

Our work demonstrates that context-agnostic evaluations, the current standard in the field, can produce unreliable conclusions and overlook critical aspects of model behavior, like adaptability and implicit bias.

Contextualized evaluations provide a simple, "plug-and-play" recipe for supplementing existing benchmarks with plausible user contexts. Through context, we can arrive at more consistent and reliable conclusions about model performance. We encourage researchers and developers to adopt contextualized evaluations for a more holistic understanding of how well models serve the diverse needs of all users.

Resources

Paper: https://arxiv.org/abs/2411.07237

Code: https://github.com/allenai/ContextEval

Data: https://huggingface.co/datasets/allenai/ContextEval