OLMES

2024
OLMES (v0.1) is an open standard for reproducible LLM evaluations, containing exact prompts, curated fewshot examples, and fully specified scoring guidelines across a suite of tasks.
License: ODC-BY

OLMES (Open Language Model Evaluation Standard) is a set of principles and associated tasks, for evaluating large language models (LLMs). The current version includes:

  • Standardized formatting of dataset instances
  • Curated, few-shot in-context examples for each task
  • Evaluate both multiple-choice (MCF) and cloze-form (CF) formulations and use maximum score
  • Standardized probability normalization schemes for CF
  • Prescribed implementations details:
    • Sampling of 1000 instances for each task if more than 1500
    • Use test split if labels are available, otherwise use validation split
    • For MMLU use macro average over tasks
    • Restrict to maximum 2048 tokens per input

For more details, see instructions here.

The curated few-shot examples can be found in this file: std_fewshot.py.

Citation:

@misc{gu2024olmes,
      title={OLMES: A Standard for Language Model Evaluations}, 
      author={Yuling Gu and Oyvind Tafjord and Bailey Kuehl and 
                      Dany Haddad and Jesse Dodge and Hannaneh Hajishirzi},
      year={2024},
      eprint={2406.08446},
      archivePrefix={arXiv}
}

Authors

Yuling Gu, Oyvind Tafjord, Bailey Kuehl, Dany Haddad, Jesse Dodge, and Hannaneh Hajishirzi