Research - Papers
Explore a selection of our published work on a variety of key research challenges in AI.
Objaverse: A Universe of Annotated 3D Objects
Massive data corpora like WebText, Wikipedia, Conceptual Captions, WebImageText, and LAION have propelled recent dramatic progress in AI. Large neural models trained on such datasets produce…
Ask4Help: Learning to Leverage an Expert for Embodied Tasks
Embodied AI agents continue to become more capable every year with the advent of new models, environments, and benchmarks, but are still far away from being performant and reliable enough to be…
ProcTHOR: Large-Scale Embodied AI Using Procedural Generation
Massive datasets and high-capacity models have driven many recent advancements in computer vision and natural language understanding. This work presents a platform to enable similar success stories…
Webly Supervised Concept Expansion for General Purpose Vision Models
General purpose vision (GPV) systems [25] are models that are designed to solve a wide array of visual tasks without requiring architectural changes. Today, GPVs primarily learn both skills and…
Towards Disturbance-Free Visual Mobile Manipulation
Deep reinforcement learning has shown promising results on an abundance of robotic tasks in simulation, including visual navigation and manipulation. Prior work generally aims to build embodied…
Benchmarking Progress to Infant-Level Physical Reasoning in AI
To what extent do modern AI systems comprehend the physical world? We introduce the open-access Infant-Level Physical Reasoning Benchmark ( InfLevel ) to gain insight into this question. We evaluate…
I can’t believe there’s no images! : Learning Visual Tasks Using Only Language Supervision
Many high-level skills that are required for computer vision tasks, such as parsing questions, comparing and contrasting semantics, and writing descriptions, are also required in other domains such…
Simple but Effective: CLIP Embeddings for Embodied AI
Contrastive language image pretraining (CLIP) encoders have been shown to be beneficial for a range of visual tasks from classification and detection to caption-ing and image manipulation. We…
MERLOT Reserve: Neural Script Knowledge through Vision and Language and Sound
This task enables it to perform well variety Abstract As humans, we navigate a multimodal world, building a holistic understanding from all our senses. We introduce MERLOT Reserve , a model that…
Towards General Purpose Vision Systems
A special purpose learning system assumes knowledge of admissible tasks at design time. Adapting such a system to unforeseen tasks requires architecture manipulation such as adding an output head…