A digital abstract image, meant to convey a hopeful futuristic feeling.

Open models - Evaluation frameworks

AI practitioners need to know if their models are performant, safe, biased, or hallucinating — and these are just a few of the measurements that can help us compare notes. Here, you’ll find Ai2’s evaluation frameworks and benchmarks, open and accessible so we can compare like-for-like outcomes.

DataDecide

How do LLM developers choose their pretraining data? All AI labs create small-scale models as experiments, but the models and their data are rarely shared. DataDecide opens up the process.

DataDecide is a suite of 1,050 models, 30k checkpoints, 10 benchmarks, and 25 datasets, along with recommendations for the best and most cost-effective benchmarks, prediction methods, and metrics to use to make decisions.

Read the blog Access models, data, and evals

An abstract image of pipes raised to various heights.

Ai2 Safety Toolkit

This suite of resources is focused on advancing LLM safety, which will empower researchers and industry professionals to work together on building safer LLMs. The suite includes WildTeaming, an automatic red-teaming framework for identifying and reproducing human-devised attacks, WildJailbreak, a high-quality, large-scale safety training dataset with 262K training examples, and WildGuard, a light-weight, multi-purpose moderation tool for assessing the safety of user-LLM interactions across three safety moderation tasks.

Access the toolkit

Paloma

Paloma is a benchmark for evaluating open language models across many different domains (ranging from niche artist communities to reddit forums on mental health). We have already evaluated several models to understand how language model performance varies across 585 different domains. We invite you to run our standardized inference code on additional models and submit their results to extend this benchmark!

Start with our ReadMe

An abstract image of digital bars, echoing the shape of clouds or mountains.

OLMES

OLMES is a standard for reproducible language model evaluations that is open, practical, completely documented, and can be applied to current leaderboards and evaluation code bases. OLMES is designed to facilitate robust comparisons of model performances, both during model development and when comparing final powerful models, and can be used across a range of model sizes e.g., from 1B to 70B.

OLMES on GitHub

CoCoNot

In addition to more straightforward safety concerns, AI practitioners should consider the cases when models should not comply with a user’s request. Noncompliance prompts should include incomplete, unsupported, indeterminate, and humanizing requests in addition to unsafe requests. CoCoNot is a dataset that includes a set of queries that should elicit noncompliance either by curating examples from existing datasets or synthetically generating them using GPT models.

CoCoNot on GitHub

ZebraLogic

LLMs excel at information-seeking and creative writing tasks. They have significantly improved in math and coding too. But how do they perform in logical reasoning? ZebraLogic evaluates the logical reasoning abilities of LLMs via logic grid puzzles, which require multiple high-order thinking skills. Results show that LLMs still lack several abilities required for logical reasoning, like analytical thinking, counterfactual thinking, reflective reasoning, structured memorization, and compositional generalization.

Earn your stripes

ConfAIde

As users share more personal information with AI like their personal home assistants, it’s crucial to understand how well those models can protect that sensitive information. The ConfAIde benchmark can be used to identify critical weaknesses in the privacy reasoning capabilities of LLMs.

ConfAIde

WildBench

WildBench is an automated evaluation framework designed to benchmark large language models (LLMs) using challenging, real-world user queries. It builds off of the WildChat dataset, and is an evolution of popular evaluation styles like AlpacaEval.

Visit the leaderboard

RewardBench

RewardBench is the first benchmark for evaluating reward models for RLHF. RewardBench evaluates the chat, instruction following, math, reasoning, and safety abilities of reward models. It has helped initiate a faster moving academic ecosystem studying this part of the alignment process.

Visit the leaderboard

RewardBench 2

A new version of RewardBench, based on unseen human data and designed to be substantially more difficult. RewardBench 2 is built on unseen human prompts from real-world chat interactions - prompts were categorized across six domains and new completions were designed and generated to create a challenging best-of-4, multi-skill benchmark.

RewardBench 2 on Hugging Face

olmOCR-bench

olmOCR-bench is a dataset of 1,403 PDF files, plus 7,010 unit test cases that capture properties of the output that a good OCR system should have. This benchmark evaluates the ability of OCR systems to accurately convert PDF documents to markdown format while preserving critical textual and structural information.

olmOCR-Bench on Hugging Face

DiscoveryBench

The first comprehensive benchmark that formalizes the multi-step process of data-driven discovery.

DiscoveryBench on Hugging Face