AI2 Reasoning Challenge (ARC) 2018

Aristo • 2018
A new dataset of 7,787 genuine grade-school level, multiple-choice science questions, assembled to encourage research in advanced question-answering. The dataset is partitioned into a Challenge Set and an Easy Set, where the former contains only questions answered incorrectly by both a retrieval-based algorithm and a word co-occurrence algorithm. We are also including a corpus of over 14 million science sentences relevant to the task, and an implementation of three neural baseline models for this dataset. We pose ARC as a challenge to the community.

Think you have solved question answering? Try the AI2 Reasoning Challenge (ARC)!

The ARC dataset consists of 7,787 science exam questions drawn from a variety of sources, including science questions provided under license by a research partner affiliated with AI2. These are text-only, English language exam questions that span several grade levels as indicated in the files. Each question has a multiple choice structure (typically 4 answer options). The questions are sorted into a Challenge Set of 2,590 “hard” questions (those that both a retrieval and a co-occurrence method fail to answer correctly) and an Easy Set of 5,197 questions. Each are pre-split into Train, Development, and Test sets as follows:

  • Challenge Train: 1,119
  • Challenge Dev: 299
  • Challenge Test: 1,172
  • Easy Train: 2,251
  • Easy Dev: 570
  • Easy Test: 2,376

Each set is provided in two formats, CSV and JSON. The CSV files contain the full text of the question and its answer options in one cell. The JSON files contain a split version of the question, where the question text has been separated from the answer options programatically.

Please note: This data should not be distributed except by the Allen Institute for Artificial Intelligence (AI2). All parties interested in acquiring this data must download it from AI2 directly at allenai.org/data/arc. This data is to be used for non-commercial, research purposes only.

JSONL Structure

The JSONL files contain the same questions split into the “stem” of the question (the question text) and then the various answer “choices” and their corresponding labels (A, B, C, D). The questionID is also included.

{
  "id": "MCAS_2000_4_6",
  "question": {
    "stem": "Which technology was developed most recently?",
    "choices": [
      {
        "text": "cellular telephone",
        "label": "A"
      },
      {
        "text": "television",
        "label": "B"
      },
      {
        "text": "refrigerator",
        "label": "C"
      },
      {
        "text": "airplane",
        "label": "D"
      }
    ]
  },
  "answerKey": "A"
}
  • id - a unique identifier for the question (our own numbering)
  • question
    • stem - the question text
    • choices - the answer choices
      • label - the answer label (“A”, “B”, “C”, “D”)
      • text - the text associated with the answer label
  • answerKey - the the correct answer option

CSV Structure

Comma-delimited (CSV) columns:

  • questionID - a unique identifier for the question (our own numbering)
  • originalQuestionID - the question number on the test
  • totalPossiblePoint - how many points the question is worth when scoring
  • AnswerKey - the correct answer option
  • isMultipleChoiceQuestion - 1 = multiple choice, 0 = other
  • includesDiagram - 1 = includes diagram, 0 = other
  • examName - the source of the exam
  • schoolGrade - grade level
  • year - publication year of the exam
  • question - the text of the question itself
  • subject - the general question topic
  • category - Test, Train, or Dev

ARC Corpus

The ARC Corpus contains 14M unordered, science-related sentences including knowledge relevant to ARC, and is provided to as a starting point for addressing the challenge. The Corpus contains sentences from: science-related documents downloaded from the Web; dictionary definitions from Wiktionary, and articles from Simple Wikipedia that were tagged as science. For details of its construction, see (Clark et al., 2018). Note that use of the corpus for the Challenge is completely optional, and also that systems are not restricted to this corpus. Please see the README included in the download for additional information and terms of use of this corpus.

Sample sentences

As an example, here are the first 10 sentences mentioning both the words “gravity” and “force”:

  1. The force of gravity overcomes the nuclear forces which keep protons and neutrons from combining. Then the super cool part- children will use magnets to explore how gravity can easily be overcome by other forces- almost like defying gravity!
  2. On the basis of their observations and analysis, they attempt to discover and explain laws describing the forces of nature, such as gravity, electromagnetism, and nuclear interactions.
  3. In a gravity dam, the force that holds the dam in place against the push from the water is Earth’s gravity pulling down on the mass of the dam.
  4. Random motion of the air molecules and turbulence provide upward forces that may counteract the downward force of gravity.
  5. The earth’s gravity acts on air molecules to create a force, that of the air pushing on the earth.
  6. The pressure and frictional heat of tectonic forces cause rock to change (metamorphose)
  7. At subduction boundaries, gravity pulls plates back down into the mantle where rock liquefies into magma, completing the rock cycle.
  8. If an object falls from one point to another point inside a gravitational field, the force of gravity will do positive work on the object, and the gravitational potential energy will decrease by the same amount.
  9. The perpendicular component of the force of gravity is directed opposite the normal force and as such balances the normal force.
  10. Whereas previous ideas of motion depended on an outside force to instigate and maintain it (i.e. wind pushing a sail) Copernicus’s theories helped to inspire the concepts of gravity and inertia.

ARC Baselines

We also provide neural baselines that we have run against ARC. The execution framework (also provided) is easily extensible to test new models on the ARC Question Set.

  • DecompAttn, based on the Decomposable Attention model of Parikh et al. (2016), a top performer on the SNLI dataset.
  • BiDAF, based on the Bidirectional Attention Flow model of Seo et al. (2017), a top performer on the SQuAD dataset.
  • DGEM, based on the Decomposable Graph Entailment Model of Khot et al. (2018), a top performer on the SciTail dataset.
  • Knowledge-free BiLSTM Max-out model with max-attention from question to choices, adapted by Mihaylov et al. from the model of Conneau et al. (2017).

Details of the first three systems and their adaptation for the multiple choice setting are given in (Clark et al. 2018). The BiLSTM Max-out model is described in this README.

The models can be downloaded here.

Leaderboard

Top Public Submissions
Details
Created
Accuracy
1
ZeroQA
Pirtoaca George Sebastian from the Polytechnic University of Bucharest
6/29/202079%
2
UnifiedQA (finetuned on ARC) - with IR
Daniel Khashabi, from AI2
4/24/202078%
3
FreeLB-RoBERTa (single model)
Microsoft Dynamics 365 AI Research & UMD
9/27/201968%
10/31/201967%
5
xlnet + roberta (ensemble)
erenup: https://github.com/erenup
8/30/201967%