Open models - Evaluation frameworks
AI practitioners need to know if their models are performative, safe, biased, or hallucinating — and these are just a few of the measurements that can help us compare notes. Here, you’ll find Ai2’s evaluation frameworks and benchmarks, open and accessible so we can compare like-for-like outcomes.
Featured framework - Ai2 Safety Toolkit
This suite of resources is focused on advancing LLM safety, which will empower researchers and industry professionals to work together on building safer LLMs. The suite includes WildTeaming, an automatic red-teaming framework for identifying and reproducing human-devised attacks, WildJailbreak, a high-quality, large-scale safety training dataset with 262K training examples, and WildGuard, a light-weight, multi-purpose moderation tool for assessing the safety of user-LLM interactions across three safety moderation tasks.
Paloma
Paloma is a benchmark for evaluating open language models across many different domains (ranging from niche artist communities to reddit forums on mental health). We have already evaluated several models to understand how language model performance varies across 585 different domains. We invite you to run our standardized inference code on additional models and submit their results to extend this benchmark!
OLMES
OLMES is a standard for reproducible language model evaluations that is open, practical, completely documented, and can be applied to current leaderboards and evaluation code bases. OLMES is designed to facilitate robust comparisons of model performances, both during model development and when comparing final powerful models, and can be used across a range of model sizes e.g., from 1B to 70B.
In addition to more straightforward safety concerns, AI practitioners should consider the cases when models should not comply with a user’s request. Noncompliance prompts should include incomplete, unsupported, indeterminate, and humanizing requests in addition to unsafe requests. CoCoNot is a dataset that includes a set of queries that should elicit noncompliance either by curating examples from existing datasets or synthetically generating them using GPT models.
LLMs excel at information-seeking and creative writing tasks. They have significantly improved in math and coding too. But how do they perform in logical reasoning? ZebraLogic evaluates the logical reasoning abilities of LLMs via logic grid puzzles, which require multiple high-order thinking skills. Results show that LLMs still lack several abilities required for logical reasoning, like analytical thinking, counterfactual thinking, reflective reasoning, structured memorization, and compositional generalization.
As users share more personal information with AI like their personal home assistants, it’s crucial to understand how well those models can protect that sensitive information. The ConfAIde benchmark can be used to identify critical weaknesses in the privacy reasoning capabilities of LLMs.
WildBench is an automated evaluation framework designed to benchmark large language models (LLMs) using challenging, real-world user queries. It builds off of the WildChat dataset, and is an evolution of popular evaluation styles like AlpacaEval.
RewardBench is the first benchmark for evaluating reward models for RLHF. RewardBench evaluates the chat, instruction following, math, reasoning, and safety abilities of reward models. It has helped initiate a faster moving academic ecosystem studying this part of the alignment process.