Research - Papers
Explore a selection of our published work on a variety of key research challenges in AI.
When Learning Is Out of Reach, Reset: Generalization in Autonomous Visuomotor Reinforcement Learning
Episodic training, where an agent's environment is reset to some initial condition after every success or failure, is the de facto standard when training embodied reinforcement learning (RL) agents.…
Objaverse: A Universe of Annotated 3D Objects
Massive data corpora like WebText, Wikipedia, Conceptual Captions, WebImageText, and LAION have propelled recent dramatic progress in AI. Large neural models trained on such datasets produce…
Ask4Help: Learning to Leverage an Expert for Embodied Tasks
Embodied AI agents continue to become more capable every year with the advent of new models, environments, and benchmarks, but are still far away from being performant and reliable enough to be…
ProcTHOR: Large-Scale Embodied AI Using Procedural Generation
Massive datasets and high-capacity models have driven many recent advancements in computer vision and natural language understanding. This work presents a platform to enable similar success stories…
Webly Supervised Concept Expansion for General Purpose Vision Models
General purpose vision (GPV) systems [25] are models that are designed to solve a wide array of visual tasks without requiring architectural changes. Today, GPVs primarily learn both skills and…
Towards Disturbance-Free Visual Mobile Manipulation
Deep reinforcement learning has shown promising results on an abundance of robotic tasks in simulation, including visual navigation and manipulation. Prior work generally aims to build embodied…
Benchmarking Progress to Infant-Level Physical Reasoning in AI
To what extent do modern AI systems comprehend the physical world? We introduce the open-access Infant-Level Physical Reasoning Benchmark ( InfLevel ) to gain insight into this question. We evaluate…
I can’t believe there’s no images! : Learning Visual Tasks Using Only Language Supervision
Many high-level skills that are required for computer vision tasks, such as parsing questions, comparing and contrasting semantics, and writing descriptions, are also required in other domains such…
Simple but Effective: CLIP Embeddings for Embodied AI
Contrastive language image pretraining (CLIP) encoders have been shown to be beneficial for a range of visual tasks from classification and detection to caption-ing and image manipulation. We…
MERLOT Reserve: Neural Script Knowledge through Vision and Language and Sound
This task enables it to perform well variety Abstract As humans, we navigate a multimodal world, building a holistic understanding from all our senses. We introduce MERLOT Reserve , a model that…