Papers

Learn more about AI2's Lasting Impact Award
Viewing 11-20 of 111 papers
  • Towards Disturbance-Free Visual Mobile Manipulation

    Tianwei Ni, Kiana Ehsani, Luca Weihs, Jordi SalvadorarXiv2022 Deep reinforcement learning has shown promising results on an abundance of robotic tasks in simulation, including visual navigation and manipulation. Prior work generally aims to build embodied agents that solve their assigned tasks as quickly as possible…
  • Benchmarking Progress to Infant-Level Physical Reasoning in AI

    Luca Weihs, Amanda Rose Yuile, Renée Baillargeon, Cynthia Fisher, Gary Marcus, Roozbeh Mottaghi, Aniruddha KembhaviTMLR2022 To what extent do modern AI systems comprehend the physical world? We introduce the open-access Infant-Level Physical Reasoning Benchmark ( InfLevel ) to gain insight into this question. We evaluate ten neural-network architectures developed for video…
  • Simple but Effective: CLIP Embeddings for Embodied AI

    Apoorv Khandelwal, Luca Weihs, Roozbeh Mottaghi, Aniruddha KembhaviCVPR2022 Contrastive language image pretraining (CLIP) encoders have been shown to be beneficial for a range of visual tasks from classification and detection to caption-ing and image manipulation. We investigate the effectiveness of CLIP visual backbones for Embodied…
  • MERLOT Reserve: Neural Script Knowledge through Vision and Language and Sound

    Rowan Zellers, Jiasen Lu, Ximing Lu, Youngjae Yu, Yanpeng Zhao, Mohammadreza Salehi, Aditya Kusupati, Jack Hessel, Ali Farhadi, Yejin ChoiCVPR2022 This task enables it to perform well variety Abstract As humans, we navigate a multimodal world, building a holistic understanding from all our senses. We introduce MERLOT Reserve , a model that represents videos jointly over time – through a new training…
  • Towards General Purpose Vision Systems

    Tanmay Gupta, A. Kamath, Aniruddha Kembhavi, Derek HoiemCVPR2022 A special purpose learning system assumes knowledge of admissible tasks at design time. Adapting such a system to unforeseen tasks requires architecture manipulation such as adding an output head for each new task or dataset. In this work, we propose a task…
  • Unified-IO: A Unified Model for Vision, Language, and Multi-Modal Tasks

    Jiasen Lu, Christopher Clark, Rowan Zellers, Roozbeh Mottaghi, Aniruddha KembhaviarXiv2022 We propose Unified-IO, a model that performs a large variety of AI tasks spanning classical computer vision tasks, including pose estimation, object detection, depth estimation and image generation, vision-and-language tasks such as region captioning and…
  • What do navigation agents learn about their environment?

    Kshitij Dwivedi, G. Roig, Aniruddha Kembhavi, Roozbeh MottaghiarXiv2022 Today’s state of the art visual navigation agents typically consist of large deep learning models trained end to end. Such models offer little to no interpretability about the learned skills or the actions of the agent taken in response to its environment…
  • A-OKVQA: A Benchmark for Visual Question Answering using World Knowledge

    Dustin Schwenk, Apoorv Khandelwal, Christopher Clark, Kenneth Marino, Roozbeh MottaghiarXiv2022 The Visual Question Answering (VQA) task aspires to provide a meaningful testbed for the development of AI models that can jointly reason over visual and natural language inputs. Despite a proliferation of VQA datasets, this goal is hindered by a set of…
  • Continuous Scene Representations for Embodied AI

    S. Gadre, Kiana Ehsani, S. Song, Roozbeh MottaghiarXiv2022 We propose Continuous Scene Representations (CSR), a scene representation constructed by an embodied agent navigating within a space, where objects and their relationships are modeled by continuous valued embeddings. Our method captures feature relationships…
  • Object Manipulation via Visual Target Localization

    Kiana Ehsani, Ali Farhadi, Aniruddha Kembhavi, Roozbeh MottaghiarXiv2022 Object manipulation is a critical skill required for Embodied AI agents interacting with the world around them. Training agents to manipulate objects, poses many challenges. These include occlusion of the target object by the agent’s arm, noisy object…