olmOCR: Efficient PDF text extraction with vision language models
February 25, 2025
Ai2
From pretraining to inference, language models (LMs) operate on plain text data. Whether during training on trillions of tokens, or when serving data-intensive AI applications, the quality of this text matters significantly. Noisy text leads to training instabilities and worse model performance, or poor outputs when completing users' requests.
However, not all data LMs use is available in easy-to-parse formats such as web pages. In fact, for many domains, valuable information is stored in electronic documents files, like PDFs. These formats present unique challenges because they're designed to render content on fixed-size pages rather than preserve logical text structure. Take PDFs, for example: this format stores text as sequence of binary character encodings, alongside their position and formatting on a page. This format, while efficient, makes it challenging to recover all text units like headings, paragraphs, tables, and equations and arrange them in the correct order.
To help with processing electronic documents, we introduce olmOCR, a high-performance toolkit designed to convert PDFs and document images into clean, structured plain text. What sets olmOCR apart?
- Performance: we fine-tune olmOCR on 250,000 pages sampled from a diverse set of PDFs. Some are born digital, while others are scanned copies of public domain books. This ensures that olmoOCR can accurately extract text from a wide range of documents.
- Cost effective: the olmoOCR toolkit can process one million PDF pages for about $190, olmOCR costs or roughly 1/32 of what you'd pay to process the same number of pages using GPT-4o APIs in batch mode.
- Markdown output: olmOCR outputs text in Markdown format, which is easy to parse and process. It can handle equations, tables, and handwriting, all in the correct reading order even for the most complex, multi-column document layouts.
- Batteries included: olmOCR is a fully optimized pipeline that works with both SGLang and vLLM inference engines. It scales efficiently from one to hundreds of GPUs and includes heuristics to handle common parsing failures and metadata errors.
- Fully open-source: olmOCR is built on top of Qwen2-VL-7B-Instruct. We release all components of the toolkit: model weights, fine-tuning dataset, training and inference code.
See how olmoOCR compares to other leading document extraction tools and learn more how we built it. Once you are ready to try it out, visit our GitHub repository to use olmOCR in your own projects.
Building olmOCR
To obtain high-quality data for training olmOCR, we develop a technique called document anchoring. This method leverages any text and metadata present in your PDF files to improve the quality of the extracted text.
Using document anchoring, we label 250,000 pages using GPT-4o. We use a combination of publicly accessible PDFs crawled from the web and public domain books scanned by the Internet Archive. The dataset is diverse, with 60% academic papers, 12% brochures, 11% legal documents, 6% diagrams, 5% slideshows, and 4% other document types.
For training the model itself, we fine-tuned a Qwen2-VL-7B-Instruct checkpoint. We carefully optimized our inference pipeline for large-scale batch processing using SGLang, enabling olmOCR to convert one million PDF pages for just $190 - about 1/32nd the cost of using GPT-4o APIs. Our results demonstrated not only significant cost savings but also superior performance in human evaluation compared to other popular OCR tools.
We evaluated olmOCR by comparing its output against other popular PDF extraction tools: Marker, MinerU, and GOT-OCR 2.0. We collected pairwise judgments from 11 researchers. We sampled 2,017 PDFs and gathered 452 meaningful comparisons, calculating ELO ratings to quantify performance. olmOCR achieves an ELO score above 1800, significantly outperforming all competitors. When directly compared to other tools, olmOCR was preferred in 61.3% of comparisons against Marker, 58.6% against GOT-OCR, and an impressive 71.4% against MinerU, demonstrating its superior ability to produce clean, well-structured text.
You can see more details and other evaluations in our technical report.
Accessing
The first olmOCR release includes a demo, model weights, our fine-tuning dataset, a brief technical report, and most importantly, an efficient inference pipeline.
Visit our GitHub repository to install olmOCR and explore the documentation. Then, on a machine with GPUs, simply run:
python -m olmocr.pipeline ./localworkspace --pdfs tests/gnarly_pdfs/horribleocr.pdf
We hope to soon release additional quantitative benchmarks to help develop better PDF extraction models and evaluate their performance.