Skip to main content ->
Ai2

Open models - Evaluation frameworks

AI practitioners need to know if their models are performative, safe, biased, or hallucinating — and these are just a few of the measurements that can help us compare notes. Here, you’ll find Ai2’s evaluation frameworks and benchmarks, open and accessible so we can compare like-for-like outcomes.

Featured framework - Ai2 Safety Toolkit

This suite of resources is focused on advancing LLM safety, which will empower researchers and industry professionals to work together on building safer LLMs. The suite includes WildTeaming, an automatic red-teaming framework for identifying and reproducing human-devised attacks, WildJailbreak, a high-quality, large-scale safety training dataset with 262K training examples, and WildGuard, a light-weight, multi-purpose moderation tool for assessing the safety of user-LLM interactions across three safety moderation tasks.

Paloma

Paloma is a benchmark for evaluating open language models across many different domains (ranging from niche artist communities to reddit forums on mental health). We have already evaluated several models to understand how language model performance varies across 585 different domains. We invite you to run our standardized inference code on additional models and submit their results to extend this benchmark!

OLMES

OLMES is a standard for reproducible language model evaluations that is open, practical, completely documented, and can be applied to current leaderboards and evaluation code bases. OLMES is designed to facilitate robust comparisons of model performances, both during model development and when comparing final powerful models, and can be used across a range of model sizes e.g., from 1B to 70B.

CoCoNot

In addition to more straightforward safety concerns, AI practitioners should consider the cases when models should not comply with a user’s request. Noncompliance prompts should include incomplete, unsupported, indeterminate, and humanizing requests in addition to unsafe requests. CoCoNot is a dataset that includes a set of queries that should elicit noncompliance either by curating examples from existing datasets or synthetically generating them using GPT models.

ZebraLogic

LLMs excel at information-seeking and creative writing tasks. They have significantly improved in math and coding too. But how do they perform in logical reasoning? ZebraLogic evaluates the logical reasoning abilities of LLMs via logic grid puzzles, which require multiple high-order thinking skills. Results show that LLMs still lack several abilities required for logical reasoning, like analytical thinking, counterfactual thinking, reflective reasoning, structured memorization, and compositional generalization.

ConfAIde

As users share more personal information with AI like their personal home assistants, it’s crucial to understand how well those models can protect that sensitive information. The ConfAIde benchmark can be used to identify critical weaknesses in the privacy reasoning capabilities of LLMs.

WildBench

WildBench is an automated evaluation framework designed to benchmark large language models (LLMs) using challenging, real-world user queries. It builds off of the WildChat dataset, and is an evolution of popular evaluation styles like AlpacaEval.

RewardBench

RewardBench is the first benchmark for evaluating reward models for RLHF. RewardBench evaluates the chat, instruction following, math, reasoning, and safety abilities of reward models. It has helped initiate a faster moving academic ecosystem studying this part of the alignment process.