Charades and Charades-Ego Datasets

These datasets guides our research into unstructured video activity recognition and commonsense reasoning for daily human activities.

Charades-Ego v1.0! Update, April 1st 2018
We are happy to announce that our new dataset has been released! Please refer to the new publications for details [*,*]. This work will be presented as a spotlight presentation at CVPR'18. Teaser Video:

Charades-Ego is dataset composed of 7860 videos of daily indoors activities collected through Amazon Mechanical Turk recorded from both third and first person. The dataset contains 68,536 temporal annotations for 157 action classes.

Code Update! Update, Feb 5th 2018
We have added PyTorch baselines to

Dataset Update! Update, Dec 1st 2017
We have added additional attributes and code to generate the visualizations in "What What Actions are Needed for Understanding Human Actions in Videos?"

Evaluation Server is still open for submissions. Update, Dec 1st 2017
The Charades Challenge evaluation server is still accepting new submissions. The 2017 challenge is over, but the leaderboard is still changing. This allows everyone to compare their method with the publicly available results of other algorithms on a held-out test set. Currently we allow 5 submissions per month, for each sufficiently unique algorithm. Do not hesitate to contact us if you have any questions. For more information visit

New analysis paper on activity recognition! Update, Sep 1st 2017
We have just released our work on analyzing the state of activity recognition (arXiv) This paper will be presented at ICCV2017 in Venice.

Dataset Update! Update, May 15th 2017
We have added precomputed Two-Stream features using the available code and models at, see the README for more details. More accurate scene annotations were also added.

Charades Challenge at CVPR2017

Update, March 1st 2017
The Charades Challenge has two tracks: Video Classification and Activity Localization. The top team in each track will be invited to give an oral presentation, and all teams encouraged to present their work in the poster session. There will also be monetary rewards for the top teams. The Charades Challenge is a part of the CVPR 2017 Workshop on Visual Understanding Across Modalities. For more information:

Dataset Update (Localization)! Update, February 27th 2017
Charades has been updated to include action localization, RGB frames, Optical Flow frames, and more detailed object and verb annotations.

New paper on activity recognition! Update, December 19th 2016
We have just released our work on demonstrating how to use the rich structure of the dataset (objects, actions, scenes, etc) to get significant gains on activity recognition. This work also introduces the problem of localizating activities. (arXiv) (Update: This paper will be presented at CVPR2017 in Honolulu, Hawaii.)

Charades v1.0! Update, July 7th 2016
We are happy to announce that the dataset has been released! Each video has been exhaustively annotated using consensus from 4 workers on the training set, and from 8 workers on the test set. Please refer to the updated accompanying publication for details. Updated paper draft: PDF

Charades is dataset composed of 9848 videos of daily indoors activities collected through Amazon Mechanical Turk. 267 different users were presented with a sentence, that includes objects and actions from a fixed vocabulary, and they recorded a video acting out the sentence (like in a game of Charades). The dataset contains 66,500 temporal annotations for 157 action classes, 41,104 labels for 46 object classes, and 27,847 textual descriptions of the videos. This work was presented at ECCV2016.

Please contact for questions about the dataset.

The Charades-Ego Dataset

  • README (Updated 4/1/18)
  • License
  • Annotations & Evaluation Code (2 MB) (Updated 4/1/18)
  • Data (scaled to 480p, 11 GB) (Updated 4/1/18)
  • Data (original size) (47 GB)
  • RGB frames at 24fps (53 GB)
  • Code @ GitHub

    The Charades Dataset

    Charades Classification performance

    • AlexNet: 11.2% mAP
    • C3D: 10.9% mAP
    • Two-Stream: 14.2% mAP
    • IDT: 17.2% mAP
    • Combined: 18.6% mAP
    • Asynchronous Temporal Fields: 22.4% mAP [*]
    • (Uses the official Charades_v1_classify.m evaluation code. More details may be found in the README and the papers below.)

    Charades Localization performance

    • Random: 2.42% mAP
    • VGG-16: 7.89
    • Two-Stream: 8.94% mAP
    • LSTM: 9.60% mAP
    • LSTM w/ post-processing: 10.4% mAP
    • Two-Stream w/ post-processing: 10.9% mAP
    • Asynchronous Temporal Fields: 12.8% mAP [*]
    • (Uses the official Charades_v1_localize.m evaluation code. More details may be found in the README and the papers below.)

    If this work helps your research, please consider citing the relevant publications:

      author = {Gunnar A. Sigurdsson and G{\"u}l Varol and Xiaolong Wang and Ali Farhadi and Ivan Laptev and Abhinav Gupta},
      title={Hollywood in Homes: Crowdsourcing Data Collection for Activity Understanding},
      booktitle={European Conference on Computer Vision},
      pdf = {},
      web = {}
    author = {Gunnar A. Sigurdsson and Abhinav Gupta and Cordelia Schmid and Ali Farhadi and Karteek Alahari},
    title={Actor and Observer: Joint Modeling of First and Third-Person Videos},


    • Hollywood in Homes: Crowdsourcing Data Collection for Activity Understanding
      Gunnar A. Sigurdsson, Gül Varol, Xiaolong Wang, Ali Farhadi, Ivan Laptev, Abhinav Gupta

      Computer vision has a great potential to help our daily lives by searching for lost keys, watering flowers or reminding us to take a pill. To succeed with such tasks, computer vision methods need to be trained from real and diverse examples of our daily dynamic scenes. While most of such scenes are not particularly exciting, they typically do not appear on YouTube, in movies or TV broadcasts. So how do we collect sufficiently many diverse but boring samples representing our lives? We propose a novel Hollywood in Homes approach to collect such data. Instead of shooting videos in the lab, we ensure diversity by distributing and crowdsourcing the whole process of video creation from script writing to video recording and annotation. Following this procedure we collect a new dataset, Charades, with hundreds of people recording videos in their own homes, acting out casual everyday activities. The dataset is composed of 9,848 annotated videos with an average length of 30 seconds, showing activities of 267 people from three continents. Each video is annotated by multiple free-text descriptions, action labels, action intervals and classes of interacted objects. In total, Charades provides 27,847 video descriptions, 66,500 temporally localized intervals for 157 action classes and 41,104 labels for 46 object classes. Using this rich data, we evaluate and provide baseline results for several tasks including action recognition and automatic description generation. We believe that the realism, diversity, and casual nature of this dataset will present unique challenges and new opportunities for computer vision community. Less

    • Much Ado About Time: Exhaustive Annotation of Temporal Data
      Gunnar A. Sigurdsson, Olga Russakovsky, Ali Farhadi, Ivan Laptev, Abhinav Gupta

      Large-scale annotated datasets allow AI systems to learn from and build upon the knowledge of the crowd. Many crowdsourcing techniques have been developed for collecting image annotations. These techniques often implicitly rely on the fact that a new input image takes a negligible amount of time to perceive. In contrast, we investigate and determine the most cost-effective way of obtaining high-quality multi-label annotations for temporal data such as videos. Watching even a short 30-second video clip requires a significant time investment from a crowd worker; thus, requesting multiple annotations following a single viewing is an important cost-saving strategy. But how many questions should we ask per video? We conclude that the optimal strategy is to ask as many questions as possible in a HIT (up to 52 binary questions after watching a 30-second video clip in our experiments). We demonstrate that while workers may not correctly answer all questions, the cost-benefit analysis nevertheless favors consensus from multiple such cheap-yet-imperfect iterations over more complex alternatives. When compared with a one-question-per-video baseline, our method is able to achieve a 10% improvement in recall (76.7% ours versus 66.7% baseline) at comparable precision (83.8% ours versus 83.0% baseline) in about half the annotation time (3.8 minutes ours compared to 7.1 minutes baseline). We demonstrate the effectiveness of our method by collecting multi-label annotations of 157 human activities on 1,815 videos. Less

    • Asynchronous Temporal Fields for Action Recognition
      Gunnar A. Sigurdsson, Santosh Divvala, Ali Farhadi, Abhinav Gupta

      Actions are more than just movements and trajectories: we cook to eat and we hold a cup to drink from it. A thorough understanding of videos requires going beyond appearance modeling and necessitates reasoning about the sequence of activities, as well as the higher-level constructs such as intentions. But how do we model and reason about these? We propose a fully-connected temporal CRF model for reasoning over various aspects of activities that includes objects, actions, and intentions, where the potentials are predicted by a deep network. End-to-end training of such structured models is a challenging endeavor: For inference and learning we need to construct mini-batches consisting of whole videos, leading to mini-batches with only a few videos. This causes high-correlation between data points leading to breakdown of the backprop algorithm. To address this challenge, we present an asynchronous variational inference method that allows efficient end-to-end training. Our method achieves a classification mAP of 22.4% on the Charades benchmark, outperforming the state-of-the-art (17.2% mAP), and offers equal gains on the task of temporal localization. Less

    • What Actions are Needed for Understanding Human Actions in Videos?
      Gunnar A. Sigurdsson, Olga Russakovsky, Abhinav Gupta

      What is the right way to reason about human activities? What directions forward are most promising? In this work, we analyze the current state of human activity understanding in videos. The goal of this paper is to examine datasets, evaluation metrics, algorithms, and potential future directions. We look at the qualitative attributes that define activities such as pose variability, brevity, and density. The experiments consider multiple state-of-the-art algorithms and multiple datasets. The results demonstrate that while there is inherent ambiguity in the temporal extent of activities, current datasets still permit effective benchmarking. We discover that fine-grained understanding of objects and pose when combined with temporal reasoning is likely to yield substantial improvements in algorithmic accuracy. We present the many kinds of information that will be needed to achieve substantial gains in activity understanding: objects, verbs, intent, and sequential reasoning. The software and additional information will be made available to provide other researchers detailed diagnostics to understand their own algorithms. Less

    • Actor and Observer: Joint Modeling of First and Third-Person Videos
      Gunnar A. Sigurdsson, Abhinav Gupta, Cordelia Schmid, Ali Farhadi, Karteek Alahari

      Several theories in cognitive neuroscience suggest that when people interact with the world, or simulate interactions, they do so from a first-person egocentric perspective, and seamlessly transfer knowledge between third-person (observer) and first-person (actor). Despite this, learning such models for human action recognition has not been achievable due to the lack of data. This paper takes a step in this direction, with the introduction of Charades-Ego, a large-scale dataset of paired first-person and third-person videos, involving 112 people, with 4000 paired videos. This enables learning the link between the two, actor and observer perspectives. Thereby, we address one of the biggest bottlenecks facing egocentric vision research, providing a link from first-person to the abundant third-person data on the web. We use this data to learn a joint representation of first and third-person videos, with only weak supervision, and show its effectiveness for transferring knowledge from the third-person to the first-person domain.Less

    • Charades-Ego: A Large-Scale Dataset of Paired Third and First Person Videos
      Gunnar A. Sigurdsson, Abhinav Gupta, Cordelia Schmid, Ali Farhadi, Karteek Alahari

      In Actor and Observer we introduced a dataset linking the first and third-person video understanding domains, the Charades-Ego Dataset. In this paper we describe the egocentric aspect of the dataset and present annotations for Charades-Ego with 68,536 activity instances in 68.8 hours of first and third-person video, making it one of the largest and most diverse egocentric datasets available. Charades-Ego furthermore shares activity classes, scripts, and methodology with the Charades dataset, that consist of additional 82.3 hours of third-person video with 66,500 activity instances. Charades-Ego has temporal annotations and textual descriptions, making it suitable for egocentric video classification, localization, captioning, and new tasks utilizing the cross-modal nature of the data.Less