OLMoASR: A series of open speech recognition models

August 28, 2025

Ai2

We’re pleased to release OLMoASR, a family of completely open automatic speech recognition (ASR) models trained from scratch on a curated, large-scale dataset.

Designed to match or surpass the performance of certain proprietary systems such as OpenAI’s Whisper, which has been widely used due to its strong in-the-wild performance, OLMoASR aims to advance the accessibility of zero-shot ASR.

Many ASR models are trained on undisclosed data, making them unreproducible, challenging to analyze, and difficult to improve. By contrast, OLMoASR embraces openness. With a new 3-million-hour weakly supervised audio-text data pool and rigorous curation techniques, our OLMoASR family achieves competitive zero-shot performance—all with fully open weights, code, and training data.

Benchmarking OLMoASR

We trained six initial OLMoASR models:

OLMoASR-tiny.en, a 39-million-parameter ASR model
OLMoASR-base.en, a 74-million-parameter ASR model
OLMoASR-small.en, a 244-million-parameter ASR model
OLMoASR-medium.en, a 769-million-parameter ASR model
OLMoASR-large.en-v1, a 1.5-billion-parameter ASR model trained on 440,000 hours of audio
OLMoASR-large.en-v2, a 1.5-billion-parameter ASR model trained on 680,000 hours of audio

To assess OLMoASR’s robustness, we evaluated the models across 21 diverse test sets – 14 short-form and 7 long-form – none of which were seen during training. These include audiobooks, calls, meetings, lectures, and more, capturing a wide range of accents, speech categories, and durations.

Evaluation results show our OLMoASR models match or exceed Whisper’s zero-shot performance across most scales:

Compared to Whisper’s largest English-only ASR model, OLMoASR-medium.en reaches 12.8% WER (short-form) and 11.0% WER (long-form), on par with Whisper-medium.en’s 12.4% (short-form) and 10.5% (long-form) WER at the same parameter count.
In terms of the largest-parameter-count model, OLMoASR-large.en-v1, trained on 440K hours of audio data (per epoch), achieves a 13.0% WER (short-form) compared to the 12.2% WER from Whisper-large-v1 (trained on 680K multilingual hours). Re-trained on an equal 680K hours (per epoch), OLMoASR-large.en-v2 shrinks the WER gap to around 0.4%.
OLMoASR-tiny.en and OLMoASR-base.en approach the short-form WER – and surpass the long-form WER – of their similarly-sized Whisper counterparts. Meanwhile, OLMoASR-small.en roughly matches Whisper-small.en on short-form and long-form WER.

These results showcase OLMoASR’s efficiency and robustness even at smaller scales.

Open, rigorous, and robust ASR

Recent open ASR models like OWSM, Distil-Whisper, and NVIDIA’s Parakeet and Canary have pushed the field forward, but they fall short in either scale, curation quality, or transparency—or aren’t zero-shot. OLMoASR is a step-change. Not only are the models trained on data of Whisper-like scale, but the entire training pipeline – including filtering heuristics and evaluation code – is open-source.

Our guiding insight is this: data quality is as important as scale. We start with OLMoASR-Pool, a massive collection of 3M hours of English audio and 17M transcripts collected from the public web. Then, using a multi-stage filtering pipeline involving audio-text language alignment, text-based heuristics, and fuzzy deduplication, we distill this pool into OLMoASR-Mix, a curated 1M-hour dataset of high-quality audio-text pairs.

This data-first approach allows us to isolate the effects of data quality on generalization without confounding factors. Architecture, tokenizer, and training recipes are held constant across experiments, allowing us to precisely measure how curation impacts performance.

Data curation: The key to strong zero-shot generalization

Our main experiment shows that quality data filtering yields significant gains across all scales, underscoring the importance of dataset curation. OLMoASR prioritizes quality over raw quantity, utilizing text-based heuristics to generate a dataset for strong robustness and zero-shot generalization.

Starting with OLMoASR-Pool, we apply a series of filters:

Audio-text language alignment to remove language-mismatched pairs
Deletion of upper-case transcripts and transcripts with repeating lines, targeting noisy machine-generated transcripts
Discarding transcripts or segments based on the WER relative to its automatically-generated counterpart to eliminate unfaithful or poorly-aligned transcripts

Our strategy is simple but effective. We demonstrate this through rigorous experiments at each filtering step to assess the larger impact of the individual steps.

Fully open and reproducible

Unlike proprietary systems, everything underpinning OLMoASR is open. This includes:

The OLMoASR-Pool and OLMoASR-Mix datasets
The data processing and filtering code used for data curation
The model weights and training pipeline
The evaluation code and benchmark scripts

We believe this transparency is essential for the speech community to advance toward truly robust and generalizable ASR systems. By opening every layer of the stack, OLMoASR allows researchers to ask new questions, reproduce results, and push the broader field forward.

OLMoASR isn’t just a new series of models—it’s a platform for open research into the role of data in speech recognition. And with everything released – from models to metrics – it’s ready for you to build on.

Resources

Try out OLMoASR on the Ai2 Playground! Just click the audio icon in the chat box to speak with OLMo.
Check out OLMoASR-Pool or the OLMoASR models on HuggingFace.
Want to peruse the code? Visit the OLMoASR GitHub repository.
Want a more in-depth breakdown of OLMoASR? Read our technical report.