StrategyQA

AI2 Israel, Question Understanding, Aristo • 2021
StrategyQA is a question-answering benchmark focusing on open-domain questions where the required reasoning steps are implicit in the question and should be inferred using a strategy. StrategyQA includes 2,780 examples, each consisting of a strategy question, its decomposition, and evidence paragraphs.

A Question Answering Benchmark with Implicit Reasoning Strategies

The StrategyQA dataset was created through a crowdsourcing pipeline for eliciting creative and diverse yes/no questions that require implicit reasoning steps. To solve questions in StrategyQA, the reasoning steps should be inferred using a strategy. To guide and evaluate the question answering process, each example in StrategyQA was annotated with a decomposition into reasoning steps for answering it, and Wikipedia paragraphs that provide evidence for the answer to each step.

Illustrated in the figure below: Questions in StrategyQA (Q1) require implicit reasoning, in contrast to multi-step questions that explicitly specify the reasoning process (Q2). Each training example contains a question (Q1), yes/no answer (A), decomposition (D), and evidence paragraphs (E).

Image

For more details on StrategyQA, please refer to our paper: “Did Aristotle Use a Laptop? A Question Answering Benchmark with Implicit Reasoning Strategies”, accepted for publication in Transactions of the Association for Computational Linguistics (TACL), 2021.

Explore

StrategyQA includes 2,780 questions, covering various knowledge domains. Below are a few of them:

Example 1

“Is growing seedless cucumber good for a gardener with entomophobia?”

Answer: Yes
Explanation: Seedless cucumber fruit does not require pollination. Cucumber plants need insects to pollinate them. Entomophobia is a fear of insects.

Example 2

“Are chinchillas cold-blooded?”

Answer: No
Explanation: Chinchillas are rodents, which are mammals. All mammals are warm-blooded.

Example 3

“Would Janet Jackson avoid a dish with ham?”

Answer: Yes
Explanation: Janet Jackson follows an Islamic practice. Islamic culture avoids eating pork. Ham is made from pork.

Example 4

“Can a honey bee sting a human more than once?”

Answer: No
Explanation: Human skin is tough, and the bee’s stinger gets lodged in the skin. The stinger becomes separated from the bee which dies soon after.

Distribution of question topics in StrategyQA (“taxon” stands for a group of organisms):

Image

Dataset

The StrategyQA dataset includes the followings files:

  • strategyqa_train.json: The training set of StrategyQA, which includes 2,290 examples.

  • strategyqa_train_paragraphs.json: Paragraphs from our corpus that were matched as evidence for examples in the training set.

  • strategyqa_train_filtered.json: 2,821 additional questions, excluded from the official training set, that were filtered by our solvers during data collection (see more details in the paper).

  • strategyqa_test.json: The test set of StrategyQA, which includes 490 examples.

In addition, the full corpus of 36.6M processed paragraphs from Wikipedia can be downloaded from here. Note that the file size is 5.8GB compressed and 19GB uncompressed.

Data Format

Examples in the datasets are stored in the following format:

  • qid: Question ID.
  • term: The Wikipedia term used to prime the question writer.
  • description: A short description of the term, extracted from Wikipedia.
  • question: A strategy question.
  • answer: A boolean answer to the question (True/False for “Yes”/“No”).
  • facts: (Noisy) facts provided by the question writer in order to guide the following annotation tasks (see more details in the paper).
  • decomposition: A sequence (list) of single-step questions that form a reasoning process for answering the question. References to answers to previous steps are marked with “#”. Further explanations can be found in the paper.
  • evidence: A list with 3 annotations, each annotation have matched evidence for each decomposition step. Evidence for a decomposition step is a list with paragraph IDs and potentially the reserved tags no_evidence and operation.

The file strategyqa_train_filtered.json does not include annotations of facts, decomposition, and evidence, and the public test examples in strategyqa_test.json include only the fields qid and question.

Evidence paragraphs in strategyqa_train_paragraphs.json are stored in a dictionary that maps paragraph IDs (e.g. Winter clothing-1) to paragraph metadata and content (e.g. “Winter clothing are clothes used for protection against the particularly cold weather…”). The fields for every paragraph are:

  • title: The title of the page the paragraph was extracted from.
  • section: The title of the section the paragraph was extracted from.
  • headers: Any headers under which the paragraph is located in the paper.
  • para_index: The index of the paragraph in the Wikipedia page.
  • content: The textual content of the paragraph, with macros and links removed.

Each line in the corpus file corpus-enwiki-20200511-cirrussearch-parasv2.jsonl.gz contains a paragraph with similar metadata fields to strategyqa_train_paragraphs.json. There are several additional metadata fields for indexing the paragraphs in ElasticSearch. The script for creating an ElasticSearch index will be provided soon.


Training example:

{
    "qid": "1089",
    "term": "Black Sea",
    "description": "Marginal sea of the Atlantic Ocean between Europe and Asia",
    "question": "Can sunlight travel to the deepest part of the Black Sea?",
    "answer": false,
    "facts": [
        "The Black Sea has a maximum depth of 2,212 meters",
        "Sunlight does not penetrate water below 1000 meters"
    ],
    "decomposition": [
        "What is the maximum depth of the Black Sea?",
        "How deep can sunlight penetrate a sea?",
        "Is #1 less than #2?"
    ],
    "evidence": [
        [
            [
                [
                    "Black Sea-2"
                ]
            ],
            [
                [
                    "Deep sea-1"
                ]
            ],
            [
                "operation"
            ]
        ],
        [
            [
                [
                    "Black Sea-2"
                ]
            ],
            [
                [
                    "Photic zone-5"
                ]
            ],
            [
                "operation"
            ]
        ],
        [
            [
                [
                    "Black Sea-2"
                ]
            ],
            [
                [
                    "Photic zone-5"
                ]
            ],
            [
                "operation"
            ]
        ],
    ]
}

Example evidence paragraph:

"Winter clothing-1": {
    "title": "Winter clothing",
    "section": "",
    "headers": [
        ""
    ],
    "para_index": 1,
    "content": "Winter clothing are clothes used for protection against the particularly cold weather of winter. Often they have a good water resistance, consist of multiple layers to protect and insulate against low temperatures."
}

Paper

Title: Did Aristotle Use a Laptop? A Question Answering Benchmark with Implicit Reasoning Strategies

Authors: Mor Geva, Daniel Khashabi, Elad Segal, Tushar Khot, Dan Roth, Jonathan Berant

Transactions of the Association for Computational Linguistics (TACL), 2021

Citation:

@article{geva2021strategyqa,
  title = {{Did Aristotle Use a Laptop? A Question Answering Benchmark with Implicit Reasoning Strategies}},
  author = {Geva, Mor and Khashabi, Daniel and Segal, Elad and Khot, Tushar and Roth, Dan and Berant, Jonathan},
  journal = {Transactions of the Association for Computational Linguistics (TACL)},
  year = {2021},
}

Contact: morgeva [at] mail.tau.ac.il

Leaderboard

Top Public Submissions
DetailsCreatedSARI
1
RoBERTa*-IR-Q (large)
Mor Geva, Daniel Khashabi, Elad Segal, Tushar Khot, Dan Roth and Jonathan Berant, from Tel Aviv University, Allen Institute for AI and University of Pennsylvania.
1/30/20210%
2
RoBERTa*-IR-D (large)
Mor Geva, Daniel Khashabi, Elad Segal, Tushar Khot, Dan Roth and Jonathan Berant, from Tel Aviv University, Allen Institute for AI and University of Pennsylvania.
1/30/202154%
3
DPR for retrieval
Inbal Leibovitch and Roi Cohen from Tel Aviv University
4/9/20210%
3
DPR for retrieval
Roi Cohen and Inbal Leibovitch from Tel Aviv University
4/10/20210%