An abstract illustration of swirling shapes, meant to denote a futuristic feeling.

Research - Papers

Explore a selection of our published work on a variety of key research challenges in AI.

The One RING: a Robotic Indoor Navigation Generalist

Ainaz EftekharLuca WeihsRose HendrixKuo-Hao Zeng

2024

arXiv

Modern robots vary significantly in shape, size, and sensor configurations used to perceive and interact with their environments. However, most navigation policies are embodiment-specific; a policy…

Task Me Anything

Jieyu ZhangWeikai HuangZixian MaRanjay Krishna

2024

NeurIPS

Benchmarks for large multimodal language models (MLMs) now serve to simultaneously assess the general capabilities of models instead of evaluating for a specific capability. As a result, when a…

m&m's: A Benchmark to Evaluate Tool-Use for multi-step multi-modal Tasks

Zixian MaWeikai HuangJieyu ZhangRanjay Krishna

2024

ECCV

Real-world multi-modal problems are rarely solved by a single machine learning model, and often require multi-step computational plans that involve stitching several models. Tool-augmented LLMs hold…

FLaRe: Achieving Masterful and Adaptive Robot Policies with Large-Scale Reinforcement Learning Fine-Tuning

Jiaheng HuRose HendrixAli FarhadiKiana Ehsan

2024

ICRA

In recent years, the Robotics field has initiated several efforts toward building generalist robot policies through large-scale multi-task Behavior Cloning. However, direct deployments of these…

Molmo and PixMo: Open Weights and Open Data for State-of-the-Art Multimodal Models

Matt DeitkeChristopher ClarkSangho LeeAniruddha Kembhavi

2024

arXiv

Today's most advanced multimodal models remain proprietary. The strongest open-weight models rely heavily on synthetic data from proprietary VLMs to achieve good performance, effectively distilling…

PoliFormer: Scaling On-Policy RL with Transformers Results in Masterful Navigators

Kuo-Hao ZengZichen ZhangKiana EhsaniLuca Weihs

2024

CoRL

We present PoliFormer (Policy Transformer), an RGB-only indoor navigation agent trained end-to-end with reinforcement learning at scale that generalizes to the real-world without adaptation despite…

Unified-IO 2: Scaling Autoregressive Multimodal Models with Vision, Language, Audio, and Action

Jiasen Lu*Christopher Clark*Sangho Lee*Aniruddha Kembhavi

2024

CVPR

We present Unified-IO 2, the first autoregressive multimodal model that is capable of understanding and generating images, text, audio, and action. To unify different modalities, we tokenize inputs…

Universal Visual Decomposer: Long-Horizon Manipulation Made Easy

Zichen ZhangYunshuang LiOsbert BastaniLuca Weihs

2024

IEEE International Conference on Robotics and Automation

Real-world robotic tasks stretch over extended horizons and encompass multiple stages. Learning long-horizon manipulation tasks, however, is a long-standing challenge, and demands decomposing the…

Selective Visual Representations Improve Convergence and Generalization for Embodied-AI

Ainaz EftekharKuo-Hao ZengJiafei DuanRanjay Krishna

2024

ICLR • Proceedings

Embodied AI models often employ off the shelf vision backbones like CLIP to encode their visual observations. Although such general purpose representations encode rich syntactic and semantic…

Harmonic Mobile Manipulation

Ruihan YangYejin KimAniruddha KembhaviKiana Ehsani

2023

IROS

Recent advancements in robotics have enabled robots to navigate complex scenes or manipulate diverse objects independently. However, robots are still impotent in many household tasks requiring…

Previous2-11Next