OLMES

• 2024

OLMES (v0.1) is an open standard for reproducible LLM evaluations, containing exact prompts, curated fewshot examples, and fully specified scoring guidelines across a suite of tasks.

Download Read Paper View Repo

License: ODC-BY

OLMES (Open Language Model Evaluation Standard) is a set of principles and associated tasks, for evaluating large language models (LLMs). The current version includes:

Standardized formatting of dataset instances
Curated, few-shot in-context examples for each task
Evaluate both multiple-choice (MCF) and cloze-form (CF) formulations and use maximum score
Standardized probability normalization schemes for CF
Prescribed implementations details:
- Sampling of 1000 instances for each task if more than 1500
- Use test split if labels are available, otherwise use validation split
- For MMLU use macro average over tasks
- Restrict to maximum 2048 tokens per input

For more details, see instructions here.

The curated few-shot examples can be found in this file: std_fewshot.py.

Citation:

@misc{gu2024olmes,
      title={OLMES: A Standard for Language Model Evaluations}, 
      author={Yuling Gu and Oyvind Tafjord and Bailey Kuehl and 
                      Dany Haddad and Jesse Dodge and Hannaneh Hajishirzi},
      year={2024},
      eprint={2406.08446},
      archivePrefix={arXiv}
}

Authors

Yuling Gu, Oyvind Tafjord, Bailey Kuehl, Dany Haddad, Jesse Dodge, and Hannaneh Hajishirzi

Natural Language Processing

Computer Vision

AI for the Environment

Experimentation and Communication

Research

Research

OLMES

Authors