Skip to main content ->
Ai2

Research - Papers

Explore a selection of our published work on a variety of key research challenges in AI.

Filter papers

Holodeck: Language Guided Generation of 3D Embodied AI Environments

Yue YangFan-Yun SunLuca WeihsChristopher Clark
2025
Computer Vision and Pattern Recognition

3D simulated environments play a critical role in Embodied AI, but their creation requires expertise and extensive manual effort, restricting their diversity and scope. To miti-gate this limitation,… 

The One RING: a Robotic Indoor Navigation Generalist

Ainaz EftekharLuca WeihsRose HendrixKuo-Hao Zeng
2024
arXiv

Modern robots vary significantly in shape, size, and sensor configurations used to perceive and interact with their environments. However, most navigation policies are embodiment-specific; a policy… 

Task Me Anything

Jieyu ZhangWeikai HuangZixian MaRanjay Krishna
2024
NeurIPS

Benchmarks for large multimodal language models (MLMs) now serve to simultaneously assess the general capabilities of models instead of evaluating for a specific capability. As a result, when a… 

m&m's: A Benchmark to Evaluate Tool-Use for multi-step multi-modal Tasks

Zixian MaWeikai HuangJieyu ZhangRanjay Krishna
2024
ECCV

Real-world multi-modal problems are rarely solved by a single machine learning model, and often require multi-step computational plans that involve stitching several models. Tool-augmented LLMs hold… 

FLaRe: Achieving Masterful and Adaptive Robot Policies with Large-Scale Reinforcement Learning Fine-Tuning

Jiaheng HuRose HendrixAli FarhadiKiana Ehsan
2024
ICRA

In recent years, the Robotics field has initiated several efforts toward building generalist robot policies through large-scale multi-task Behavior Cloning. However, direct deployments of these… 

Molmo and PixMo: Open Weights and Open Data for State-of-the-Art Multimodal Models

Matt DeitkeChristopher ClarkSangho LeeAniruddha Kembhavi
2024
arXiv

Today's most advanced multimodal models remain proprietary. The strongest open-weight models rely heavily on synthetic data from proprietary VLMs to achieve good performance, effectively distilling… 

PoliFormer: Scaling On-Policy RL with Transformers Results in Masterful Navigators

Kuo-Hao ZengZichen ZhangKiana EhsaniLuca Weihs
2024
CoRL

We present PoliFormer (Policy Transformer), an RGB-only indoor navigation agent trained end-to-end with reinforcement learning at scale that generalizes to the real-world without adaptation despite… 

Unified-IO 2: Scaling Autoregressive Multimodal Models with Vision, Language, Audio, and Action

Jiasen Lu*Christopher Clark*Sangho Lee*Aniruddha Kembhavi
2024
CVPR

We present Unified-IO 2, the first autoregressive multimodal model that is capable of understanding and generating images, text, audio, and action. To unify different modalities, we tokenize inputs… 

Universal Visual Decomposer: Long-Horizon Manipulation Made Easy

Zichen ZhangYunshuang LiOsbert BastaniLuca Weihs
2024
IEEE International Conference on Robotics and Automation

Real-world robotic tasks stretch over extended horizons and encompass multiple stages. Learning long-horizon manipulation tasks, however, is a long-standing challenge, and demands decomposing the… 

Selective Visual Representations Improve Convergence and Generalization for Embodied-AI

Ainaz EftekharKuo-Hao ZengJiafei DuanRanjay Krishna
2024
ICLR • Proceedings

Embodied AI models often employ off the shelf vision backbones like CLIP to encode their visual observations. Although such general purpose representations encode rich syntactic and semantic… 

1-10Next