Menu

Charades Dataset

This dataset guides our research into unstructured video activity recognition and commonsense reasoning for daily human activities.

New analysis paper on activity recognition! Update, Sep 1st 2017
We have just released our work on analyzing the state of activity recognition (arXiv) This paper will be presented at ICCV2017 in Venice.

Dataset Update! Update, May 15th 2017
We have added precomputed Two-Stream features using the available code and models at github.com/gsig/charades-algorithms, see the README for more details. More accurate scene annotations were also added.

Charades Challenge at CVPR2017

Update, March 1st 2017
The Charades Challenge has two tracks: Video Classification and Activity Localization. The top team in each track will be invited to give an oral presentation, and all teams encouraged to present their work in the poster session. There will also be monetary rewards for the top teams. The Charades Challenge is a part of the CVPR 2017 Workshop on Visual Understanding Across Modalities. For more information: vuchallenge.org/charades.html

Dataset Update (Localization)! Update, February 27th 2017
Charades has been updated to include action localization, RGB frames, Optical Flow frames, and more detailed object and verb annotations.

New paper on activity recognition! Update, December 19th 2016
We have just released our work on demonstrating how to use the rich structure of the dataset (objects, actions, scenes, etc) to get significant gains on activity recognition. This work also introduces the problem of localizating activities. (arXiv) (Update: This paper will be presented at CVPR2017 in Honolulu, Hawaii.)

Charades v1.0! Update, July 7th 2016
We are happy to announce that the dataset has been released! Each video has been exhaustively annotated using consensus from 4 workers on the training set, and from 8 workers on the test set. Please refer to the updated accompanying publication for details. Updated paper draft: PDF

Charades is dataset composed of 9848 videos of daily indoors activities collected through Amazon Mechanical Turk. 267 different users were presented with a sentence, that includes objects and actions from a fixed vocabulary, and they recorded a video acting out the sentence (like in a game of Charades). The dataset contains 66,500 temporal annotations for 157 action classes, 41,104 labels for 46 object classes, and 27,847 textual descriptions of the videos. This work was presented at ECCV2016.

Please contact vision.amt@allenai.org for questions about the dataset.

Dataset

Classification performance

  • AlexNet: 11.2% mAP
  • C3D: 10.9% mAP
  • Two-Stream: 14.2% mAP
  • IDT: 17.2% mAP
  • Combined: 18.6% mAP
  • Asynchronous Temporal Fields: 22.4% mAP [*]
  • (Uses the official Charades_v1_classify.m evaluation code. More details may be found in the README and the papers below.)

Localization performance

  • Random: 2.42% mAP
  • VGG-16: 7.89
  • Two-Stream: 8.94% mAP
  • LSTM: 9.60% mAP
  • LSTM w/ post-processing: 10.4% mAP
  • Two-Stream w/ post-processing: 10.9% mAP
  • Asynchronous Temporal Fields: 12.8% mAP [*]
  • (Uses the official Charades_v1_localize.m evaluation code. More details may be found in the README and the papers below.)

If this work helps your research, please cite:

@inproceedings{sigurdsson2016hollywood,
  author = {Gunnar A. Sigurdsson and G{\"u}l Varol and Xiaolong Wang and Ali Farhadi and Ivan Laptev and Abhinav Gupta},
  title={Hollywood in Homes: Crowdsourcing Data Collection for Activity Understanding},
  booktitle={European Conference on Computer Vision},
  year={2016},
  pdf = {http://arxiv.org/pdf/1604.01753.pdf},
  web = {http://allenai.org/plato/charades/}
}
        

Papers

  • Hollywood in Homes: Crowdsourcing Data Collection for Activity Understanding
    Gunnar A. Sigurdsson, Gül Varol, Xiaolong Wang, Ali Farhadi, Ivan Laptev, Abhinav Gupta

    Computer vision has a great potential to help our daily lives by searching for lost keys, watering flowers or reminding us to take a pill. To succeed with such tasks, computer vision methods need to be trained from real and diverse examples of our daily dynamic scenes. While most of such scenes are not particularly exciting, they typically do not appear on YouTube, in movies or TV broadcasts. So how do we collect sufficiently many diverse but boring samples representing our lives? We propose a novel Hollywood in Homes approach to collect such data. Instead of shooting videos in the lab, we ensure diversity by distributing and crowdsourcing the whole process of video creation from script writing to video recording and annotation. Following this procedure we collect a new dataset, Charades, with hundreds of people recording videos in their own homes, acting out casual everyday activities. The dataset is composed of 9,848 annotated videos with an average length of 30 seconds, showing activities of 267 people from three continents. Each video is annotated by multiple free-text descriptions, action labels, action intervals and classes of interacted objects. In total, Charades provides 27,847 video descriptions, 66,500 temporally localized intervals for 157 action classes and 41,104 labels for 46 object classes. Using this rich data, we evaluate and provide baseline results for several tasks including action recognition and automatic description generation. We believe that the realism, diversity, and casual nature of this dataset will present unique challenges and new opportunities for computer vision community. Less

  • Much Ado About Time: Exhaustive Annotation of Temporal Data
    Gunnar A. Sigurdsson, Olga Russakovsky, Ali Farhadi, Ivan Laptev, Abhinav Gupta

    Large-scale annotated datasets allow AI systems to learn from and build upon the knowledge of the crowd. Many crowdsourcing techniques have been developed for collecting image annotations. These techniques often implicitly rely on the fact that a new input image takes a negligible amount of time to perceive. In contrast, we investigate and determine the most cost-effective way of obtaining high-quality multi-label annotations for temporal data such as videos. Watching even a short 30-second video clip requires a significant time investment from a crowd worker; thus, requesting multiple annotations following a single viewing is an important cost-saving strategy. But how many questions should we ask per video? We conclude that the optimal strategy is to ask as many questions as possible in a HIT (up to 52 binary questions after watching a 30-second video clip in our experiments). We demonstrate that while workers may not correctly answer all questions, the cost-benefit analysis nevertheless favors consensus from multiple such cheap-yet-imperfect iterations over more complex alternatives. When compared with a one-question-per-video baseline, our method is able to achieve a 10% improvement in recall (76.7% ours versus 66.7% baseline) at comparable precision (83.8% ours versus 83.0% baseline) in about half the annotation time (3.8 minutes ours compared to 7.1 minutes baseline). We demonstrate the effectiveness of our method by collecting multi-label annotations of 157 human activities on 1,815 videos. Less

  • Asynchronous Temporal Fields for Action Recognition
    Gunnar A. Sigurdsson, Santosh Divvala, Ali Farhadi, Abhinav Gupta

    Actions are more than just movements and trajectories: we cook to eat and we hold a cup to drink from it. A thorough understanding of videos requires going beyond appearance modeling and necessitates reasoning about the sequence of activities, as well as the higher-level constructs such as intentions. But how do we model and reason about these? We propose a fully-connected temporal CRF model for reasoning over various aspects of activities that includes objects, actions, and intentions, where the potentials are predicted by a deep network. End-to-end training of such structured models is a challenging endeavor: For inference and learning we need to construct mini-batches consisting of whole videos, leading to mini-batches with only a few videos. This causes high-correlation between data points leading to breakdown of the backprop algorithm. To address this challenge, we present an asynchronous variational inference method that allows efficient end-to-end training. Our method achieves a classification mAP of 22.4% on the Charades benchmark, outperforming the state-of-the-art (17.2% mAP), and offers equal gains on the task of temporal localization. Less

  • What Actions are Needed for Understanding Human Actions in Videos?
    Gunnar A. Sigurdsson, Olga Russakovsky, Abhinav Gupta

    What is the right way to reason about human activities? What directions forward are most promising? In this work, we analyze the current state of human activity understanding in videos. The goal of this paper is to examine datasets, evaluation metrics, algorithms, and potential future directions. We look at the qualitative attributes that define activities such as pose variability, brevity, and density. The experiments consider multiple state-of-the-art algorithms and multiple datasets. The results demonstrate that while there is inherent ambiguity in the temporal extent of activities, current datasets still permit effective benchmarking. We discover that fine-grained understanding of objects and pose when combined with temporal reasoning is likely to yield substantial improvements in algorithmic accuracy. We present the many kinds of information that will be needed to achieve substantial gains in activity understanding: objects, verbs, intent, and sequential reasoning. The software and additional information will be made available to provide other researchers detailed diagnostics to understand their own algorithms. Less

Video