Why Artificial Analysis uses Ai2's IFBench instruction-following eval

May 11, 2026

Ai2

A helpful AI model does more than give a plausible answer—it follows instructions exactly. That sounds simple, but it isn't. A user might ask for a three-sentence summary, a rewrite in a casual tone, or an answer that includes one word and avoids another—often all at once. A model can understand the topic and still get it wrong.

Accepted to NeurIPS 2025, our IFBench benchmark tests how well language models follow precise natural-language instructions. Last year, Artificial Analysis – an independent AI benchmarking organization – added IFBench to its Intelligence Index, which combines several evaluations into a single score to measure a model’s overall capability.

"We saw that a model's ability to follow user instructions was something developers cared a lot about, so we wanted to assess it explicitly,” says Declan Jackson, a member of the technical staff at Artificial Analysis. “IFBench was designed to fill that gap and was challenging even for the frontier models."

Measuring adherence to a prompt

IFBench doesn't just test basic instruction following—it forces a model to adhere to several rules in a single response. Some are easy, like minimum word counts or required keywords. Others are trickier: sentences that have to match in length, words in a row can't start with the same letter, or a keyword has to land in an exact spot.

"That is different from many other benchmarks, where instruction following is captured only indirectly through output templates or requested answer structures," says Jackson.

Each constraint might seem arbitrary on its own, but together they reflect a familiar situation: users often ask a model for several things at once, and missing even one can ruin the answer. To ground IFBench in real-world use, its prompts are pulled from real user conversations—not written from scratch by researchers.

"IFBench measures instruction following in a way that feels closer to real-world use than earlier instruction following evals,” says Jackson. “The prompts use casual, user-like language, cover a wide range of tones and lengths rather than following a fixed template, and focus on common tasks such as factual question answering, content review and summarization, and creative support. IFBench’s wider coverage also makes it … a stronger overall signal of instruction-following ability."

What IFBench shows that other benchmarks miss

AI benchmarks usually have a short shelf life. Once models start scoring near the top, the evals stop being useful for telling systems apart. Most evaluations added to Artificial Analysis's Intelligence Index saturate within about six months, according to Jackson.

But IFBench hasn't.

"While IFBench scores have improved over time, that progress has not been uniform across models, and new frontier models still do not always perform well on it,” says Jackson.

There are a couple of reasons for this.

The first is that complex instruction following doesn't have much overlap with the capabilities most labs are actively training for, says Jackson. Coding and tool use get heavy post-training investment because gains there tend to generalize across other tasks and benchmarks. Instruction following is narrower, and it rarely improves as a byproduct of progress in those areas.

The second reason is the sheer breadth of what IFBench measures. Its wide set of constraints and prompts means progress has been slower relative to more targeted domain or capability evaluations, which labs can chip away at with focused post-training recipes.

This shows up in the numbers. IFBench scores cluster sharply by model family, says Jackson, and the rankings don't line up with Artificial Analysis's broader Intelligence Index.

xAI still leads IFBench, with Grok 4.20 (0309, Reasoning) taking the top spot at 82.9% and Grok 4.3 close behind at 81.3%. Recent Google models also score well: Gemini 3 Flash Preview (Reasoning) reaches 78.0%, while Gemini 3.1 Flash-Lite Preview and Gemini 3.1 Pro Preview land at 77.2% and 77.1%, respectively. OpenAI’s GPT-5.5 (xhigh) and GPT-5.4 (xhigh) follow at 75.9% and 73.9%. The leading Claude models cluster lower on IFBench, with Claude Opus 4.7, Claude Sonnet 4.6, and Claude 4.5 Haiku scoring between 54.3% and 58.6%—even though Claude Opus 4.7 ranks near the top of the Intelligence Index at 57 points, behind GPT-5.5 (xhigh) at 60 and effectively tied with Gemini 3.1 Pro Preview and GPT-5.4 (xhigh)—which also score 57.

A truly open approach to evals

IFBench is useful to Artificial Analysis for two reasons: what it measures, and the fact that we released it openly.

Openness lets Jackson's team implement the evaluation faithfully and run it across a wide range of models, feeding the leaderboards their users rely on. It also makes the benchmark itself easier to understand, since anyone can see what's being measured and why.

For Artificial Analysis, IFBench tests something that comes up in nearly every AI interaction: whether a model can keep track of what a user is asking, especially when the request has a lot going on. It's a regular part of Artificial Analysis’ evaluations now, and a strong example of what Ai2's open benchmarks bring to the field.

"Beyond evaluations, Ai2 is an important leader in open source," says Jackson. "Their work not only helps advance the industry through open research, but also gives users access to research artifacts with transparency around data and methodology.”

Why Artificial Analysis uses Ai2's IFBench instruction-following eval

Measuring adherence to a prompt

What IFBench shows that other benchmarks miss

A truly open approach to evals

Subscribe to receive monthly updates about the latest Ai2 news.