Charades Dataset

This dataset guides our research into unstructured video activity recognition and commonsense reasoning for daily human activities.

New paper on activity recognition! Update, December 19th 2016
We have just released our work on demonstrating how to use the rich structure of the dataset (objects, actions, scenes, etc) to get significant gains on activity recognition. This work also introduces the problem of localizating activities. (arXiv)

Charades v1.0! Update, July 7th 2016
We are happy to announce that the dataset has been released! Each video has been exhaustively annotated using consensus from 4 workers on the training set, and from 8 workers on the test set. Please refer to the updated accompanying publication for details. Updated paper draft: PDF

Charades is dataset composed of 9848 videos of daily indoors activities collected through Amazon Mechanical Turk. 267 different users were presented with a sentence, that includes objects and actions from a fixed vocabulary, and they recorded a video acting out the sentence (like in a game of Charades). The dataset contains 66,500 temporal annotations for 157 action classes, 41,104 labels for 46 object classes, and 27,847 textual descriptions of the videos. This work will be presented at ECCV2016.

Please contact for questions about the dataset.


Classification performance

  • AlexNet: 11.2% mAP
  • C3D: 10.9% mAP
  • Two-Stream: 14.2% mAP
  • IDT: 17.2% mAP
  • Combined: 18.6% mAP
  • Connected Temporal Fields: 21.9% mAP
  • More details may be found in the papers below.

If this work helps your research, please cite:

  author = {Gunnar A. Sigurdsson and G{\"u}l Varol and Xiaolong Wang and Ali Farhadi and Ivan Laptev and Abhinav Gupta},
  title={Hollywood in Homes: Crowdsourcing Data Collection for Activity Understanding},
  booktitle={European Conference on Computer Vision},
  pdf = {},
  web = {}


  • Hollywood in Homes: Crowdsourcing Data Collection for Activity Understanding
    Gunnar A. Sigurdsson, Gül Varol, Xiaolong Wang, Ali Farhadi, Ivan Laptev, Abhinav Gupta

    Computer vision has a great potential to help our daily lives by searching for lost keys, watering flowers or reminding us to take a pill. To succeed with such tasks, computer vision methods need to be trained from real and diverse examples of our daily dynamic scenes. While most of such scenes are not particularly exciting, they typically do not appear on YouTube, in movies or TV broadcasts. So how do we collect sufficiently many diverse but boring samples representing our lives? We propose a novel Hollywood in Homes approach to collect such data. Instead of shooting videos in the lab, we ensure diversity by distributing and crowdsourcing the whole process of video creation from script writing to video recording and annotation. Following this procedure we collect a new dataset, Charades, with hundreds of people recording videos in their own homes, acting out casual everyday activities. The dataset is composed of 9,848 annotated videos with an average length of 30 seconds, showing activities of 267 people from three continents. Each video is annotated by multiple free-text descriptions, action labels, action intervals and classes of interacted objects. In total, Charades provides 27,847 video descriptions, 66,500 temporally localized intervals for 157 action classes and 41,104 labels for 46 object classes. Using this rich data, we evaluate and provide baseline results for several tasks including action recognition and automatic description generation. We believe that the realism, diversity, and casual nature of this dataset will present unique challenges and new opportunities for computer vision community. Less

  • Much Ado About Time: Exhaustive Annotation of Temporal Data
    Gunnar A. Sigurdsson, Olga Russakovsky, Ali Farhadi, Ivan Laptev, Abhinav Gupta

    Large-scale annotated datasets allow AI systems to learn from and build upon the knowledge of the crowd. Many crowdsourcing techniques have been developed for collecting image annotations. These techniques often implicitly rely on the fact that a new input image takes a negligible amount of time to perceive. In contrast, we investigate and determine the most cost-effective way of obtaining high-quality multi-label annotations for temporal data such as videos. Watching even a short 30-second video clip requires a significant time investment from a crowd worker; thus, requesting multiple annotations following a single viewing is an important cost-saving strategy. But how many questions should we ask per video? We conclude that the optimal strategy is to ask as many questions as possible in a HIT (up to 52 binary questions after watching a 30-second video clip in our experiments). We demonstrate that while workers may not correctly answer all questions, the cost-benefit analysis nevertheless favors consensus from multiple such cheap-yet-imperfect iterations over more complex alternatives. When compared with a one-question-per-video baseline, our method is able to achieve a 10% improvement in recall (76.7% ours versus 66.7% baseline) at comparable precision (83.8% ours versus 83.0% baseline) in about half the annotation time (3.8 minutes ours compared to 7.1 minutes baseline). We demonstrate the effectiveness of our method by collecting multi-label annotations of 157 human activities on 1,815 videos. Less

  • Asynchronous Temporal Fields for Action Recognition
    Gunnar A. Sigurdsson, Santosh Divvala, Ali Farhadi, Abhinav Gupta

    Actions are more than just movements and trajectories: we cook to eat and we hold a cup to drink from it. A thorough understanding of videos requires going beyond appearance modeling and necessitates reasoning about the sequence of activities, as well as the higher-level constructs such as intentions. But how do we model and reason about these? We propose a fully-connected temporal CRF model for reasoning over various aspects of activities that includes objects, actions, and intentions, where the potentials are predicted by a deep network. End-to-end training of such structured models is a challenging endeavor: For inference and learning we need to construct mini-batches consisting of whole videos, leading to mini-batches with only a few videos. This causes high-correlation between data points leading to breakdown of the backprop algorithm. To address this challenge, we present an asynchronous variational inference method that allows efficient end-to-end training. Our method achieves a classification mAP of 21.9% on the Charades benchmark, outperforming the state-of-the-art (17.2% mAP), and offers equal gains on the task of temporal localization. Less