Clinical NLP using Olmo
“Having truly open-science models like Olmo is indispensable.” — Byron Wallace, Professor of Computer Science at Northeastern University
In healthcare, small modeling choices can have big consequences. If a model internally represents a patient’s gender or race in ways that influence the text it generates—or the risk scores it produces—clinicians, researchers, and regulators need to know.
A team at Northeastern University – Hiba Ahsan, Arnab Sen Sharma, Silvio Amir, David Bau, and Byron Wallace – wanted to find out if that was happening, and they used Olmo to do it.
They began with a high-stakes question: Can we pinpoint where an LLM encodes sociodemographic attributes and see how those signals affect clinical tasks? Because this requires access to a model’s weights and activations, it’s only practical with a truly open model—hence the choice of Olmo.
The team downloaded Olmo and ran the models locally on a single NVIDIA A100 GPU. They used a custom dataset derived from real clinical notes to perform their experiments.
“What I liked about working with the Olmo family of models was that we could study if our findings generalized across scale in a compute-constrained academic setting,” Ahsan, a PhD student at the Khoury College of Computer Sciences at Northeastern University, says.
The researchers found that – when asked to write vignettes for certain conditions – Olmo over-represented certain genders, a common problem in language modeling. But because the team was able to localize where gender was encoded in the model, they were able to make small, targeted edits to those components to flip the model’s biases.
The same interventions shifted the model’s outputs on clinically relevant tasks, including whether patients were judged to be at higher risk of depression.
“This work required an open model so that we could access weights and activations,” Wallace, a computer science professor at Northeastern, says. “Having truly open-science models is indispensable.”
That‘s Olmo’s differentiation in a sentence: Many models are powerful and open, but few provide the ingredients to explain and fix their behavior. Olmo’s weights are open, and its pre-training corpus (Dolma) and tooling are openly published, so researchers can study the model’s training procedure, internal mechanisms, and behavior.
If trust is the metric, openness is the method. Olmo turns evaluation into explanation—and explanation into improvement.
The Northeastern study, “Elucidating Mechanisms of Demographic Bias in LLMs for Healthcare,” will be published at EMNLP 2025 in November.