OLMES (v0.1) is an open standard for reproducible LLM evaluations, containing exact prompts, curated fewshot examples, and fully specified scoring guidelines across a suite of tasks.
License: ODC-BY

OLMES (Open Language Model Evaluation Standard) is a set of principles and associated tasks, for evaluating large language models (LLMs). The current version includes:

  • Standardized formatting of dataset instances
  • Curated, few-shot in-context examples for each task
  • Evaluate both multiple-choice (MCF) and cloze-form (CF) formulations and use maximum score
  • Standardized probability normalization schemes for CF
  • Prescribed implementations details:
    • Sampling of 1000 instances for each task if more than 1500
    • Use test split if labels are available, otherwise use validation split
    • For MMLU use macro average over tasks
    • Restrict to maximum 2048 tokens per input

For more details, see instructions here.

The curated few-shot examples can be found in this file: std_fewshot.py.


Yuling Gu, Oyvind Tafjord, Bailey Kuehl, Dany Haddad, Jesse Dodge, and Hannaneh Hajishirzi