Introducing Ai2 ScholarQA
January 21, 2025
Literature review takes up a lot of time for researchers. While emerging AI tools can help get answers from a single paper, we found that researchers often need to compare and summarize multiple papers and understand the complex relationships between them. Ai2 ScholarQA is an experimental solution for this need — you can ask scientific questions that require multiple documents to answer. With table comparisons, expandable sections for subtopics, and citations with paper excerpts for verification, ScholarQA helps researchers get more in-depth, detailed, and contextual answers.
Ai2 ScholarQA follows a RAG-based, multi-step prompting workflow using a state-of-the-art closed model (Claude Sonnet 3.5). It relies on a corpus of open-access papers (e.g., arXiv).
Try Ai2 ScholarQA now: scholarqa.allen.ai
Corpus and the Search Index
For retrieval, our index is a Vespa cluster that currently holds about 8M academic papers from various fields of study, such as computer science, medicine, environmental science, and biology. This index is updated weekly and uses Open S2ORC as the inclusion criteria for open access papers. When we search this index, we use a combination of BM25 and dense embeddings to score snippets extracted from full-text papers.
Given the corpus, ScholarQA will be most useful for researchers in fields with most papers available on arXiv.
If you have a Semantic Scholar API key, the full-text search feature is available to you now. If you have an academic-affiliated email address in certain countries, you are eligible to request a key and access the feature.
Section Planning and Generation
Ai2 ScholarQA is meant to satisfy literature searches that require insights from multiple relevant documents, and synthesize those insights into a comprehensive report. After receiving a query, the system first queries the index for the top k passages. These passages are further re-ranked with a pretrained transformer model and the top 50 candidates are retained for further processing. The answer generation is a 3-step process driven by prompts to an LLM:
- Quote extraction: The top re-ranked passages are fed to the LLM to select the most relevant quotes for answering the user query. This improves the precision of the candidate passages and reduces context overload in subsequent steps.
- Answer outline and clustering: The quotes are then provided to the LLM to generate a plan, which includes section headers for the report along with the relevant quotes to be included in each section. The format of each section can be either a paragraph or a bulleted list. Paragraph sections can convey nuanced relations between different papers, while bulleted list sections enumerate closely related papers, such as models, datasets, or interactive systems for the same tasks.
- Report generation: The section headers and quotes are finally used to generate the report. The report is generated one section at a time conditioned on the text from previously generated sections. The section text is accompanied by a TLDR summary at the top along with attribution to the quotes and their papers for further analysis.
Paper Comparison Table Generation
When conducting literature reviews, scientists often create literature review tables—tables in which the rows are papers you want to compare, and the columns are common aspects (like methods, datasets, or findings) that help you compare and contrast the papers. This is especially useful when you have a set of closely related papers that share many specific common aspects. Therefore, for each “list of item” sections, we also generate a literature comparison table for the cited papers.
We leverage LMs to perform this task by decomposing it into schema generation and value generation steps. Compared to jointly generating both schemas and values, we found this two-step framework can both improve the specificity and quality of generated schemas, and reduce model hallucination.
For schema generation, we represent input papers using their titles and abstracts as well as including the initial user queries and generated section in the context to represent user intent. Each generated column consists of a short display name (e.g., Data Selection Method) and a definition (e.g., The technique used to select relevant data for pretraining. This could include heuristics, influence models, resampling methods…[snipped])
For value generation, for each cell, we prompt with the full text of each paper and the column definition to generate a cell display value and relevant snippets from the paper. We map the generated supporting snippets back to the actual sentences in the paper.
In the paper, we collected a large set of gold table datasets we called ArXiVDigestable. Using the gold table, we proposed a way to benchmark this literature table generation task. We also conducted additional experiments, such as including additional context extracted from the original papers to see how they influence the quality of the generated schemas. (Described in detail in the paper)
Learnings and Next Steps
Ai2 ScholarQA is an experimental solution to help researchers conduct literature reviews more efficiently by providing more in-depth answers. We built an evidence-first pipeline, where the model focuses on writing an answer built around evidence, rather than writing an answer and then trying to find evidence. We found that with this approach, the model may lose its ability to communicate coherently - for example, the model might be a little off-topic when trying to integrate the evidence the model is finding. The model wants to fit all the evidence into the answer, even if it’s only related in a limited or peripheral sense. Additionally, because the model is screening an enormous corpus to locate these evidence-based answers and is forming a more organized and in-depth answer, response times are longer than other models.
We will be open-sourcing the core functionality in the coming weeks. In the future, we will be exploring more ways to assist scientific research with AI, such as more personalization. By providing these resources and knowledge to the community, we hope to unlock more potential for AI to help accelerate science.
Ai2 ScholarQA is a joint project between Ai2 and students from the University of Washington and the Korea Advanced Institute of Science & Technology (KAIST).