Ai2 OpenScholar: Scientific literature synthesis with retrieval-augmented language models
Akari Asai / November 19, 2024
Demo: https://openscholar.allen.ai/
Paper: https://openscholar.allen.ai/paper
OpenScholar code: https://github.com/AkariAsai/OpenScholar
ScholarQABench code: https://github.com/AkariAsai/ScholarQABench
Expert evaluation code: https://github.com/AkariAsai/OpenScholar_ExpertEval
Model checkpoints, index, data: https://huggingface.co/collections/OpenScholar/openscholar-v1-67376a89f6a80f448da411a6
On the shoulders of giants
Scientific progress hinges on our ability to find, synthesize, and build on relevant knowledge from the scientific literature. However, the exponential growth of this literature—with millions of papers now published each year—has made it increasingly difficult for scientists to find the information they need or even stay abreast of the latest findings in a single subfield.
To help scientists effectively navigate and synthesize scientific literature, we introduce Ai2 OpenScholar—a collaborative effort between the University of Washington and the Allen Institute for AI. OpenScholar is a retrieval-augmented language model (LM) designed to answer user queries by first searching for relevant papers in the literature and then generating responses grounded in those sources. Below are some examples:
On ScholarQABench, our new benchmark of open-ended scientific questions, OpenScholar-8B sets the state of the art on factuality and citation accuracy. For instance, on biomedical research questions, GPT-4o hallucinated more than 90% of the scientific papers that it cited, whereas OpenScholar-8B—by construction—remains grounded in real retrieved papers. To evaluate the effectiveness of OpenScholar in a real-world setup, we recruited 20 scientists working in computer science, biomedicine, and physics, and asked them to evaluate OpenScholar responses against expert-written answers. Across these three scientific disciplines, OpenScholar-8B’s responses were considered more useful than expert-written answers for the majority of questions.
OpenScholar is a research prototype, and it is just our first step toward building AI systems that can effectively assist scientists and accelerate scientific discovery. To support research in this direction, we have open-sourced all of our code, LM, retriever and re-ranker checkpoints, retrieval index, and data, including the training data for our language model and retriever, our OpenScholar datastore of academic papers, and the evaluation data in ScholarQABench. To our knowledge, this is the first open release of a complete pipeline for a scientific assistant LM—from data to training recipes to model checkpoints—and we’re excited to see how the community builds upon it.
Check out the Ai2 OpenScholar Demo at openscholar.allen.ai!
While the original OpenScholar model/code/data/results cover multiple scientific domains, this demo is currently limited to questions and papers about computer science; we hope to expand support to other scientific fields soon. We are also preparing to publicly release the retrieval service that backs our demo as a separate public API, which will provide full-text search over open-access papers available through Ai2’s Semantic Scholar API.
How OpenScholar works
Our OpenScholar-8B (OS-8B) system comprises the following components:
- OpenScholar Datastore: A collection of more than 45M papers from Semantic Scholar and ~250M corresponding passage embeddings. The underlying data comes from an updated version of peS2o (Soldaini et al., 2024) that consists of papers up to October 2024.
- Specialized Retrievers and Rerankers: These tools are trained specifically to identify relevant passages from our scientific literature datastore.
- Specialized 8B Language Model: An 8B-parameter LM optimized for scientific literature synthesis tasks, balancing performance with computational efficiency. To train this, we fine-tune Llama 3.1 8B on synthetic data generated from our iterative self-feedback generation pipeline, described below.
- Iterative Self-Feedback Generation: At inference, we use iterative self-feedback to refine model outputs through natural language feedback. Each iteration involves additionally retrieving more papers, allowing us to improve quality and close citation gaps.
Our datastore, retriever and reranking models, and self-feedback generation pipeline can also be applied on top of other off-the-shelf LMs. Below, we discuss results on both OS-8B, which uses our specialized 8B model, and on OS-GPT4o, which uses GPT-4o as the base LM.
Note: For scalability reasons, our demo currently only uses a datastore of computer science papers, and it also uses a more efficient sparse-dense index compared to the flat dense index we used for the paper. Changing the retrieval system does not require retraining OS-8B, demonstrating the flexibility of our method.
ScholarQABench: Realistic evaluations of open-ended scientific questions
To evaluate OpenScholar, we developed ScholarQABench—a specialized benchmark designed to assess LLMs on open-ended scientific questions that require synthesizing information from multiple papers.
ScholarQABench consists of seven datasets—three existing datasets that focus on single-paper evaluations, and four newly collected datasets of questions that require synthesis over multiple papers:
- ScholarQA-Bio: 1,451 biomedical research questions sourced from experts with PhDs in relevant areas.
- ScholarQA-Neuro: 1,308 neuroscience research questions sourced from experts with PhDs in relevant areas.
- ScholarQA-CS: 100 computer science research questions, each with a set of rubric criteria that answers should meet, and sourced from experts with PhDs in computer science.
- ScholarQA-Multi: 108 questions and answers with citations, sourced from experts in Computer Science, Physics, and Biomedicine. Each expert spent about 1 hour on average writing comprehensive answers to each question.
Evaluating long-form answers in expert domains is challenging. For ScholarQABench, we developed automated metrics to evaluate Correctness, Citation F1 (how many statements are sufficiently and precisely supported by their citations?), Coverage, Relevance, and Organization. To assess the quality of answers, we compare model output to the expert-written rubrics in Scholar-CS and the expert-written responses in Scholar-Multi.
While we believe ScholarQABench is the first realistic multidisciplinary literature synthesis benchmark, it still has several limitations, which we discuss in the paper—for example, it only covers four scientific fields, and has relatively few expert-written responses. We hope that future work will build on ScholarQABench to develop increasingly realistic and accurate scientific literature benchmarks to guide progress in this field.
Key results
In automated evaluation on ScholarQABench as well as human evaluations, OpenScholar models performed better than leading proprietary and open-source models, including GPT-4 and Llama 3.1 70B.
Results on ScholarQABench
- Overall generation quality of OpenScholar-8B: OS-8B outperforms much larger proprietary or open models such as GPT4o by 6.1% as well as models more specialized for scientific literature and/or retrieval, such as PaperQA2 (Skarlinski et al., 2024), which uses GPT4o throughout its pipeline, by 5.5%.
- Citation generation: When answering open-ended research questions, we found that GPT-4o and other models generated inaccurate or nonexistent citations in 80–95% of cases, with near-zero citation F1. In contrast, OS-8B significantly improved citation F1.
- Applying the OpenScholar pipeline to GPT4o: We applied our OpenScholar datastore, retriever and reranker, and self-feedback generation pipeline on top of GPT-4o. Compared to the base GPT-4o model on ScholarQA-CS, this improves correctness by 12% and dramatically improves citation F1 from 0.1 to 39.5.
- Cost efficiency: OS-8B is 100x more cost-efficient than concurrent systems such as PaperQA2, which relies on GPT4o for reranking, summarization, and answer generation, by employing smaller yet competitive retrievers and generator LMs.
Expert evaluations
We recruited 20 experts from diverse fields, including computer science, physics, and biomedicine, to evaluate OpenScholar's responses on the Scholar-Multi dataset. These experts compared OpenScholar's outputs against corresponding human-written responses, each of which required approximately one hour for a scientist (who either has or is working toward a Ph.D.) to compose. Together, the experts completed over 500 fine-grained evaluations and pairwise comparisons. We found that:
- Experts preferred OpenScholar responses: Participants preferred OpenScholar's outputs 70% of the time, noting that they were more comprehensive, well-organized, and useful for literature synthesis.
- Other models were limited: Models without retrieval capabilities, such as GPT-4o, struggled with information coverage and were judged to be less helpful than human experts.
- OpenScholar responses were more useful than human-written responses: OpenScholar consistently outperformed scientist-written answers in terms of information coverage and utility.
Limitations and future directions
OpenScholar is a research prototype. As the results above show, we’re confident that it gives more accurate and reliable answers to scientific research questions than other models. However, it still has several limitations, which we highlight in the hope that we can collectively address them in future work:
- OpenScholar may cite papers that are less representative. For instance, when describing a particular method, it may fail to cite the original paper that proposed the method, and instead cite another paper that mentions the method. Example: This response misses the original paper that first described edge evaluation as the main bottleneck in planning.
- OpenScholar may occasionally generate responses that are unsupported by citations, or retrieve papers that are not the most relevant or up-to-date in the field. Example: When asked about large foundation models in robotics, this response cites a paper with a 307M parameter model, whereas the current largest foundation model in robotics (as of November 2024), RT-2, has 55 billion parameters.
- OpenScholar may still generate citations directly from parametric knowledge instead of relying on the papers it has retrieved. These citations might be hallucinated and might not correspond to any real paper. In our demo, such citations do not appear as links. Example: This response cites Si et al. even though that paper was not retrieved.
- Many scientific papers are paywalled. To ensure that we respect all applicable licenses and copyrights, the OpenScholar datastore includes only open-access papers. This can significantly degrade our ability to answer questions in fields where closed-access papers are more prevalent. We hope that future work can address this issue by developing ways of responsibly incorporating such papers (e.g., by restricting verbatim copying from those papers, and instead linking out to their respective publisher sites).
Conclusion
We present OpenScholar, our open retrieval-augmented LM, which demonstrates superior generation quality and citation accuracy compared to existing proprietary systems. These results are supported by automatic evaluations on ScholarQABench and expert assessments across four scientific disciplines. We welcome your feedback and suggestions to help us further enhance OpenScholar. Try out our public demo at https://openscholar.allen.ai/ and share your thoughts!
Demo: https://openscholar.allen.ai/
Paper: https://openscholar.allen.ai/paper
OpenScholar code: https://github.com/AkariAsai/OpenScholar
ScholarQABench code: https://github.com/AkariAsai/ScholarQABench
Expert evaluation code: https://github.com/AkariAsai/OpenScholar_ExpertEval
Model checkpoints, index, data: https://huggingface.co/collections/OpenScholar/openscholar-v1-67376a89f6a80f448da411a6
Acknowledgments
This work is led by researchers from the University of Washington and Ai2, in collaboration with Meta, Carnegie Mellon University, the University of Illinois Urbana-Champaign, Stanford University, and the University of North Carolina, Chapel Hill. Akari Asai was supported by the Meta AI Mentorship program. We are grateful for support from the Singapore National Research Foundation and the National AI Group in the Singapore Ministry of Digital Development and Information under the AI Visiting Professorship Programme.