Scientific literature synthesis with retrieval-augmented language models

November 19, 2024

Akari Asai - Ai2

Demo

Paper

Code

ScholarQABench code

Expert evaluation code

Model checkpoints, index, data

On the shoulders of giants

Scientific progress hinges on our ability to find, synthesize, and build on relevant knowledge from the scientific literature. However, the exponential growth of this literature—with millions of papers now published each year—has made it increasingly difficult for scientists to find the information they need or even stay abreast of the latest findings in a single subfield.

To help scientists effectively navigate and synthesize scientific literature, we introduce the first fully open retrieval-augmented LMs, through a collaborative effort between the University of Washington and the Allen Institute for AI. Our system is designed to answer user queries by first searching for relevant papers in the literature and then generating responses grounded in those sources. Below are some examples:

On ScholarQABench, our new benchmark of open-ended scientific questions, our new 8B LM sets the state of the art on factuality and citation accuracy. For instance, on biomedical research questions, GPT-4o hallucinated more than 90% of the scientific papers that it cited, whereas our 8B—by construction—remains grounded in real retrieved papers. To evaluate the effectiveness of our new system in a real-world setup, we recruited 20 scientists working in computer science, biomedicine, and physics, and asked them to evaluate their responses against expert-written answers. Across these three scientific disciplines, our 8B LM’s responses were considered more useful than expert-written answers for the majority of questions.

This is a research prototype, and it is just our first step toward building AI systems that can effectively assist scientists and accelerate scientific discovery. To support research in this direction, we have open-sourced all of our code, LM, retriever and re-ranker checkpoints, retrieval index, and data, including the training data for our language model and retriever, our datastore of academic papers, and the evaluation data in ScholarQABench. To our knowledge, this is the first open release of a complete pipeline for a scientific assistant LM—from data to training recipes to model checkpoints—and we’re excited to see how the community builds upon it.

Check out the Demo at openscilm.allen.ai!

How our system works

Our 8B system comprises the following components:

Datastore: A collection of more than 45M papers from Semantic Scholar and ~250M corresponding passage embeddings. The underlying data comes from an updated version of peS2o (Soldaini et al., 2024) that consists of papers up to October 2024.
Specialized Retrievers and Rerankers: These tools are trained specifically to identify relevant passages from our scientific literature datastore.
Specialized 8B Language Model: An 8B-parameter LM optimized for scientific literature synthesis tasks, balancing performance with computational efficiency. To train this, we fine-tune Llama 3.1 8B on synthetic data generated from our iterative self-feedback generation pipeline, described below.
Iterative Self-Feedback Generation: At inference, we use iterative self-feedback to refine model outputs through natural language feedback. Each iteration involves additionally retrieving more papers, allowing us to improve quality and close citation gaps.

Our datastore, retriever and reranking models, and self-feedback generation pipeline can also be applied on top of other off-the-shelf LMs. Below, we discuss results on both OS-8B, which uses our specialized 8B model, and on OS-GPT4o, which uses GPT-4o as the base LM.

ScholarQABench: Realistic evaluations of open-ended scientific questions

We developed ScholarQABench—a specialized benchmark designed to assess LLMs on open-ended scientific questions that require synthesizing information from multiple papers.

ScholarQABench consists of seven datasets—three existing datasets that focus on single-paper evaluations, and four newly collected datasets of questions that require synthesis over multiple papers:

ScholarQA-Bio: 1,451 biomedical research questions sourced from experts with PhDs in relevant areas.
ScholarQA-Neuro: 1,308 neuroscience research questions sourced from experts with PhDs in relevant areas.
ScholarQA-CS: 100 computer science research questions, each with a set of rubric criteria that answers should meet, and sourced from experts with PhDs in computer science.
ScholarQA-Multi: 108 questions and answers with citations, sourced from experts in Computer Science, Physics, and Biomedicine. Each expert spent about 1 hour on average writing comprehensive answers to each question.

Evaluating long-form answers in expert domains is challenging. For ScholarQABench, we developed automated metrics to evaluate Correctness, Citation F1 (how many statements are sufficiently and precisely supported by their citations?), Coverage, Relevance, and Organization. To assess the quality of answers, we compare model output to the expert-written rubrics in Scholar-CS and the expert-written responses in Scholar-Multi.

While we believe ScholarQABench is the first realistic multidisciplinary literature synthesis benchmark, it still has several limitations, which we discuss in the paper—for example, it only covers four scientific fields, and has relatively few expert-written responses. We hope that future work will build on ScholarQABench to develop increasingly realistic and accurate scientific literature benchmarks to guide progress in this field.

Key results

In automated evaluation on ScholarQABench as well as human evaluations, our models (OS-GPT4o, OS-8B) performed better than leading proprietary and open-source models, including GPT-4 and Llama 3.1 70B.

Results on ScholarQABench

Overall generation quality of our 8B: OS-8B outperforms much larger proprietary or open models such as GPT4o by 6.1% as well as models more specialized for scientific literature and/or retrieval, such as PaperQA2 (Skarlinski et al., 2024), which uses GPT4o throughout its pipeline, by 5.5%.
Citation generation: When answering open-ended research questions, we found that GPT-4o and other models generated inaccurate or nonexistent citations in 80–95% of cases, with near-zero citation F1. In contrast, OS-8B significantly improved citation F1.
Applying our pipeline to GPT4o: We applied our datastore, retriever and reranker, and self-feedback generation pipeline on top of GPT-4o. Compared to the base GPT-4o model on ScholarQA-CS, this improves correctness by 12% and dramatically improves citation F1 from 0.1 to 39.5.
Cost efficiency: OS-8B is 100x more cost-efficient than concurrent systems such as PaperQA2, which relies on GPT4o for reranking, summarization, and answer generation, by employing smaller yet competitive retrievers and generator LMs.

Expert evaluations

We recruited 20 experts from diverse fields, including computer science, physics, and biomedicine, to evaluate our system's responses on the Scholar-Multi dataset. These experts compared system outputs against corresponding human-written responses, each of which required approximately one hour for a scientist (who either has or is working toward a Ph.D.) to compose. Together, the experts completed over 500 fine-grained evaluations and pairwise comparisons. We found that:

Experts preferred our system's responses: Participants preferred our system's outputs (OS-8B,OS-GPT4o) 70% of the time, noting that they were more comprehensive, well-organized, and useful for literature synthesis.
Other models were limited: Models without retrieval capabilities, such as GPT-4o, struggled with information coverage and were judged to be less helpful than human experts.

Limitations and future directions

As the results above show, we’re confident that it gives more accurate and reliable answers to scientific research questions than other models. However, it still has several limitations, which we highlight in the hope that we can collectively address them in future work:

It may cite papers that are less representative. For instance, when describing a particular method, it may fail to cite the original paper that proposed the method, and instead cite another paper that mentions the method. Example: This response misses the original paper that first described edge evaluation as the main bottleneck in planning.
It may occasionally generate responses that are unsupported by citations, or retrieve papers that are not the most relevant or up-to-date in the field. Example: When asked about large foundation models in robotics, this response cites a paper with a 307M parameter model, whereas the current largest foundation model in robotics (as of November 2024), RT-2, has 55 billion parameters.
It may still generate citations directly from parametric knowledge instead of relying on the papers it has retrieved. These citations might be hallucinated and might not correspond to any real paper. In our demo, such citations do not appear as links. Example: This response cites Si et al. even though that paper was not retrieved.
Many scientific papers are paywalled. To ensure that we respect all applicable licenses and copyrights, the datastore includes only open-access papers. This can significantly degrade our ability to answer questions in fields where closed-access papers are more prevalent. We hope that future work can address this issue by developing ways of responsibly incorporating such papers (e.g., by restricting verbatim copying from those papers, and instead linking out to their respective publisher sites).

Conclusion

We present our open retrieval-augmented LM, which demonstrates superior generation quality and citation accuracy compared to existing proprietary systems. These results are supported by automatic evaluations on ScholarQABench and expert assessments across four scientific disciplines. We welcome your feedback and suggestions to help us further enhance our system. Try out our public demo at openscilm.allen.ai and share your thoughts!

Expert evaluation code

Model checkpoints, index, data

Acknowledgments

This work is led by researchers from the University of Washington and Ai2, in collaboration with Meta, Carnegie Mellon University, the University of Illinois Urbana-Champaign, Stanford University, and the University of North Carolina, Chapel Hill. Akari Asai was supported by the Meta AI Mentorship program. We are grateful for support from the Singapore National Research Foundation and the National AI Group in the Singapore Ministry of Digital Development and Information under the AI Visiting Professorship Programme.

Scientific literature synthesis with retrieval-augmented language models

On the shoulders of giants

How our system works

ScholarQABench: Realistic evaluations of open-ended scientific questions

Key results

Limitations and future directions

Conclusion

Acknowledgments

Subscribe to receive monthly updates about the latest Ai2 news.