An abstract illustration of swirling shapes, meant to denote a futuristic feeling.

Research - Papers

Explore a selection of our published work on a variety of key research challenges in AI.

Evaluating n-Gram Novelty of Language Models Using Rusty-DAWG

William MerrillNoah A. SmithYanai Elazar

2024

EMNLP

How novel are texts generated by language models (LMs) relative to their training corpora? In this work, we investigate the extent to which modern LMs generate /n/-grams from their training data,…

Measuring and Improving Attentiveness to Partial Inputs with Counterfactuals

Yanai ElazarBhargavi ParanjapeHao PengNoah A. Smith

2024

EMNLP Findings

The inevitable appearance of spurious correlations in training datasets hurts the generalization of NLP models on unseen data. Previous work has found that datasets with paired inputs are prone to…

Merge to Learn: Efficiently Adding Skills to Language Models with Model Merging

Jacob Daniel MorrisonNoah A. SmithHanna HajishirziPradeep Dasigi

2024

EMNLP Findings

Adapting general-purpose language models to new skills is currently an expensive process that must be repeated as new instruction datasets targeting new skills are created, or can cause the models…

Scalable Data Ablation Approximations for Language Models through Modular Training and Merging

Clara NaIan MagnussonAnanya Harsh JhaPradeep Dasigi

2024

EMNLP

Training data compositions for Large Language Models (LLMs) can significantly affect their downstream performance. However, a thorough data ablation study exploring large sets of candidate data…

ComPO: Community Preferences for Language Model Personalization

Sachin KumarChan Young ParkYulia TsvetkovHanna Hajishirzi

2024

arXiv.org

Conventional algorithms for training language models (LMs) with human feedback rely on preferences that are assumed to account for an"average"user, disregarding subjectivity and finer-grained…

Molmo and PixMo: Open Weights and Open Data for State-of-the-Art Multimodal Models

Matt DeitkeChristopher ClarkSangho LeeAniruddha Kembhavi

2024

arXiv

Today's most advanced multimodal models remain proprietary. The strongest open-weight models rely heavily on synthetic data from proprietary VLMs to achieve good performance, effectively distilling…

OLMoE: Open Mixture-of-Experts Language Models

Niklas MuennighoffLuca SoldainiDirk GroeneveldHannaneh Hajishirzi

2024

arXiv

We introduce OLMoE, a fully open, state-of-the-art language model leveraging sparse Mixture-of-Experts (MoE). OLMoE-1B-7B has 7 billion (B) parameters but uses only 1B per input token. We pretrain…

Data Contamination Report from the 2024 CONDA Shared Task

Oscar SainzIker Garc'ia-FerreroAlon JacoviJinglin Yang

2024

arXiv

The 1st Workshop on Data Contamination (CONDA 2024) focuses on all relevant aspects of data contamination in natural language processing, where data contamination is understood as situations where…

Evaluating In-Context Learning of Libraries for Code Generation

Arkil PatelSiva ReddyDzmitry BahdanauPradeep Dasigi

2024

NAACL

Contemporary Large Language Models (LLMs) exhibit a high degree of code generation and comprehension capability. A particularly promising area is their ability to interpret code modules from…

The Bias Amplification Paradox in Text-to-Image Generation

P. SeshadriSameer SinghYanai Elazar

2024

NAACL

Bias amplification is a phenomenon in which models increase imbalances present in the training data. In this paper, we study bias amplification in the text-to-image domain using Stable Diffusion by…

Previous32-41Next