Think you have solved question answering? Try the AI2 Reasoning Challenge (ARC)!
The ARC dataset consists of 7,787 science exam questions drawn from a variety of sources, including science questions provided under license by a research partner affiliated with AI2. These are text-only, English language exam questions that span several grade levels as indicated in the files. Each question has a multiple choice structure (typically 4 answer options). The questions are sorted into a Challenge Set of 2,590 “hard” questions (those that both a retrieval and a co-occurrence method fail to answer correctly) and an Easy Set of 5,197 questions. Each are pre-split into Train, Development, and Test sets as follows:
Each set is provided in two formats, CSV and JSON. The CSV files contain the full text of the question and its answer options in one cell. The JSON files contain a split version of the question, where the question text has been separated from the answer options programatically.
Please note: This data should not be distributed except by the Allen Institute for Artificial Intelligence (AI2). All parties interested in acquiring this data must download it from AI2 directly at allenai.org/data/arc. This data is to be used for non-commercial, research purposes only.
The JSONL files contain the same questions split into the “stem” of the question (the question text) and then the various answer “choices” and their corresponding labels (A, B, C, D). The questionID is also included.
{
"id": "MCAS_2000_4_6",
"question": {
"stem": "Which technology was developed most recently?",
"choices": [
{
"text": "cellular telephone",
"label": "A"
},
{
"text": "television",
"label": "B"
},
{
"text": "refrigerator",
"label": "C"
},
{
"text": "airplane",
"label": "D"
}
]
},
"answerKey": "A"
}
Comma-delimited (CSV) columns:
The ARC Corpus contains 14M unordered, science-related sentences including knowledge relevant to ARC, and is provided to as a starting point for addressing the challenge. The Corpus contains sentences from: science-related documents downloaded from the Web; dictionary definitions from Wiktionary, and articles from Simple Wikipedia that were tagged as science. For details of its construction, see (Clark et al., 2018). Note that use of the corpus for the Challenge is completely optional, and also that systems are not restricted to this corpus. Please see the README included in the download for additional information and terms of use of this corpus.
As an example, here are the first 10 sentences mentioning both the words “gravity” and “force”:
We also provide neural baselines that we have run against ARC. The execution framework (also provided) is easily extensible to test new models on the ARC Question Set.
Details of the first three systems and their adaptation for the multiple choice setting are given in (Clark et al. 2018). The BiLSTM Max-out model is described in this README.
The models can be downloaded here.
Details | Created | Accuracy |
---|---|---|
1 ST-MoE-32B Google Brain | 1/5/2022 | 87% |
2 UnifiedQA + ARC MC/DA + IR Aristo team at Allen Institute for AI | 1/19/2021 | 81% |
3 UnifiedQA - v2 (T5-11B) Daniel Khashabi | 10/30/2020 | 81% |
4 GenMC NanJing University (Zixian Huang, Ao Wu, Jiaying Zhou, Yu Gu, Yue Zhao, Gong Cheng) | 4/17/2022 | 80% |
5 ZeroQA Pirtoaca George Sebastian from the Polytechnic University of Bucharest | 6/29/2020 | 79% |