Research - Papers
Explore a selection of our published work on a variety of key research challenges in AI.
Towards General Purpose Vision Systems
A special purpose learning system assumes knowledge of admissible tasks at design time. Adapting such a system to unforeseen tasks requires architecture manipulation such as adding an output head…
Unified-IO: A Unified Model for Vision, Language, and Multi-Modal Tasks
We propose Unified-IO, a model that performs a large variety of AI tasks spanning classical computer vision tasks, including pose estimation, object detection, depth estimation and image generation,…
What do navigation agents learn about their environment?
Today’s state of the art visual navigation agents typically consist of large deep learning models trained end to end. Such models offer little to no interpretability about the learned skills or the…
A-OKVQA: A Benchmark for Visual Question Answering using World Knowledge
The Visual Question Answering (VQA) task aspires to provide a meaningful testbed for the development of AI models that can jointly reason over visual and natural language inputs. Despite a…
Continuous Scene Representations for Embodied AI
We propose Continuous Scene Representations (CSR), a scene representation constructed by an embodied agent navigating within a space, where objects and their relationships are modeled by continuous…
Object Manipulation via Visual Target Localization
Object manipulation is a critical skill required for Embodied AI agents interacting with the world around them. Training agents to manipulate objects, poses many challenges. These include occlusion…
Interactron: Embodied Adaptive Object Detection
Over the years various methods have been proposed for the problem of object detection. Recently, we have wit-nessed great strides in this domain owing to the emergence of powerful deep neural…
Multi-Modal Answer Validation for Knowledge-Based VQA
The problem of knowledge-based visual question answering involves answering questions that require external knowledge in addition to the content of the image. Such knowledge typically comes in a…
Vessel Detection in Sentinel-1 Imagery
In this document, we detail the approach in our xView3 submission. The xView3 dataset presents the challenge of detecting vessels and other maritime objects in synthetic aperture radar (SAR) images…
Bridging the Imitation Gap by Adaptive Insubordination
Why do agents often obtain better reinforcement learning policies when imitating a worse expert? We show that privileged information used by the expert is marginalized in the learned agent policy,…