How2Everything: Mining the web to evaluate and improve LLMs on real-world procedures

February 10, 2026

Ai2

People ask chatbots how to do things all the time—how to fix a leaky faucet, how to file taxes, how to negotiate a raise, and so on. One estimate suggests around 8.5% of ChatGPT conversations are requests for this kind of step-by-step guidance. And as AI systems take on more complex tasks, generating reliable instructions becomes even more important.

But here's the problem: how do you measure whether an AI's instructions would work? You can't run the procedure inside a test—no benchmark is going to have someone file for divorce or rewire an electrical panel just to verify the steps. And surface-level comparisons don't catch the errors that matter most, like a missing prerequisite or steps in the wrong order that would cause the whole thing to fail.

That's why today we're releasing How2Everything, a framework for evaluating and improving how well AI models generate step-by-step procedures. It includes a pipeline for extracting real-world procedures from the web (351K examples from nearly a million pages across 14 topics), a 7,000-example benchmark for testing models, and an open evaluation model that checks whether a procedure contains any critical failure that would prevent someone from achieving their goal.

Our evaluations show that training models against this signal – rewarding procedures with fewer critical failures – improves their performance by more than 10 points on our How2Bench benchmark without degrading their capabilities elsewhere.

More broadly, this work offers a worked example of how pretraining web data can support a closed loop of capability evaluation and improvement at scale. The web provides a virtually unbounded supply of open-ended, naturally occurring real-world documents that can serve as reference anchors when execution-based verification is infeasible. By mining and standardizing this data into an evaluable format, and by developing an evaluation protocol that targets task-level validity and can be made reliable and reproducible at scale, we turn an otherwise hard-to-measure behavior into a practical development loop.

Why procedures are a missing benchmark primitive

Procedures matter everywhere—for example, in agents, planning and tool use hinge on producing a correct sequence of actions. Yet existing datasets are often constrained by domain, by source, or by metrics that don't reflect whether a procedure would actually succeed. How2Everything is designed to be broad, scalable, and focused on real-world validity.

How2Everything has three main components: How2Mine, a pipeline for extracting procedures from the web; How2Bench, a benchmark for evaluating models; and How2Score, an evaluation method and open judge model called How2Judge. We also release training data and recipes for improving models directly.

How2Mine

How2Mine is a pipeline for extracting and standardizing procedures from web pages at scale. It starts from the DCLM web corpus, uses WebOrganizer to identify tutorial-style pages, then applies stratified sampling to ensure diversity across 14 topics—everything from art & design and food & dining to crime & law, electronics & hardware, and transportation.

The pipeline then uses GPT-4.1 to process these pages through multiple stages: extracting candidate procedures from the raw HTML; filtering out procedures that are UI-dependent, non-sequential, or nonsensical; applying heuristics (keeping only those with 5–15 steps); extracting resource lists; and running final validation.

Running How2Mine over 980K documents yields 351,162 structured procedures, each with a topic, a goal, a list of required resources, and reference steps. Processing at this scale required 252K API calls at a cost of around $5,700.

Even after filtering, not every reference procedure is perfect. As a quality check, we validated the benchmark references with GPT-4.1, which rated 96.6% of them as valid.

How2Bench

How2Bench is a benchmark for testing how well models generate procedures. It's built by sampling 500 procedures per topic from the How2Mine pool, with the remaining procedures reserved for training.

To evaluate a model, How2Bench provides a goal (e.g., "change a flat tire"), a list of available resources, and the number of steps N the procedure should have. The model must then generate exactly N single-sentence steps. This controlled setup makes results comparable across models.

Unlike many benchmarks that saturate quickly in the course of model development, How2Bench shows clear scaling trends across both model size and training progress—making it useful for tracking improvements well before a model reaches frontier performance.

How2Score

How2Score is an evaluation method designed to measure whether a procedure would actually work—not just whether it sounds helpful.

Specifically, How2Score checks whether a procedure contains any critical failure that would prevent someone from achieving their goal. Critical failures include missing steps, unnecessary actions that would derail the process, contradictions, or vagueness severe enough to make the procedure unusable—like skipping a legally required waiting period in a property sale, or leaving out essential cooking temperatures and times.

Using a proprietary model like GPT-5 to do this is accurate, but it's expensive at scale and makes results hard for others to reproduce; evaluating 7,000 examples with GPT-5 costs around $15.

To make How2Score practical for widespread use, we distilled to produce an open judge model called How2Judge. First, we validated our critical-failure evaluation framework against human annotations—200 examples labeled by three annotators. Then we used GPT-5 to generate 73K judgments and trained an open 8B model based on Qwen 3 to replicate those decisions.

The resulting judge agrees with GPT-5 90.5% of the time and matches human majority labels 80.5% of the time—accurate enough to enable low-cost, reproducible evaluation and to serve as a reward signal for training.

Improving models with How2Everything

How2Everything isn't just an evaluation framework—it’s also meant to help improve models. A subset of procedures from How2Mine can serve as training data, while the How2Score judge provides a reward signal. Procedures with fewer critical failures score higher on How2Bench.

Our framework produces substantial gains in generating valid step-by-step procedures as measured by How2Bench. Qwen3-4B-Inst improved from 30.3 to 43.5 (+13.2 points), Qwen3-8B-Inst from 38.5 to 48.6 (+10.1), and Olmo 3 7B Think from 27.3 to 37.9 (+10.6). Importantly, these gains don't come at the cost of other capabilities—results on 12 out-of-domain benchmarks show no systematic degradation.

One important finding: explicit length control matters during training. Without it, models learn to game the judge by producing longer, more verbose outputs. An ablation shows inflated How2Bench scores paired with much longer procedures when length control is removed, a useful reminder that LLM-as-judge setups need careful design.

What we're releasing

We're releasing all the code and data associated with How2Everything, including the How2Mine pipeline and prompts, the full 351K procedure dataset and How2Bench split, the distilled How2Score judge (8B), and training recipes for fine-tuning with How2Score as a reward.

If you're building instruction-following systems, tool-using agents, or anything that depends on reliable step-by-step guidance, How2Everything lets you benchmark whether your model's procedures would actually work and train directly for fewer critical failures.

Model | Tech Report | Code | Data