Introducing Ai2 Paper Finder
March 26, 2025
Today we release Ai2 Paper Finder, an LLM-powered literature search system.
We believe that AI-based literature search should follow the research and thought process that a human researcher would use when looking for relevant papers in their field. Ai2 Paper Finder is built on this philosophy, and it excels at locating papers that are hard to find using existing search tools.
Consider your own research process and how you search for papers. For example, imagine you’re looking for papers that introduce a dataset of an unscripted dialogue between 2 speakers (written or transcription) in English where there is an annotation of some property (emotion, age, gender, etc.) of one of the speakers.
How would you approach it today?
You’ll likely start by choosing a tool, either Semantic Scholar, Google Scholar, or a regular Google search, or ask an LLM like GPT. You will then come up with search terms (based on your knowledge of both the domain and the tool you chose), and see what comes up. The results will likely not perfectly match what you need, but will be a good start. From the search results, your intuition will guide you to more follow-ups: maybe you learn new vocabulary or are reminded of a related concept that leads you to a new search query, or you find a promising lead and start to follow citations or a specific author’s work. The point is that literature search is a multi-step process that involves learning and iterating as you go.
We built this thinking process right into Ai2 Paper Finder. When you enter a query, you can watch as the system breaks down your query into relevant components, searches for papers, follows citations, evaluates for relevance, runs follow-up queries based on the results, and then presents not only the papers, but also short summaries of why the paper is relevant to your specific query.
Ai2 Paper Finder doesn’t need you to simplify your queries into keywords to perform an effective search, like that original example we started with, which we can enter into Paper Finder as-written, “papers that introduce a dataset of an unscripted dialogue between 2 speakers (written or transcription) in English where there is an annotation of some property (emotion, age, gender, etc..) of one of the speakers.” See the results.
To what extent can we create an AI agent that effectively mimics the research process in this way? This question is of both practical and academic interest. On the practical side, this will save researchers countless hours, days, or even weeks, and allow us to find leads that we otherwise would have missed. On the academic side, this raises many interesting research questions around modeling long-term processes with LLM agents, systems that learn in a goal-directed way, human-computer interaction and interactive computation, and so many other complex challenges.
We are excited to share the first iteration of Ai2 Paper Finder with the community. This is a work in progress, and we look forward to having everyone interact with it, use it for day-to-day work, and provide feedback on the results. Please try it out, explore its boundaries, and let us know if it doesn’t work as expected! We understand that Ai2 Paper Finder is far from perfect and we will actively monitor your feedback and look for opportunities to improve.
Ai2 Paper Finder vs other tools
How does Ai2 Paper Finder differ from other literature search solutions? First and foremost, we aim to be as open as possible. We are very interested in the research questions that openness enables, and we are committed to fully and openly describing every aspect of the system. To that end, a detailed tech report is coming soon. We also aim to be open about the query stream we receive, which, pending users’ opting in, we plan to mine for interesting queries and release as community-wide benchmarks. While issues relating to academic copyright complicate the process of open-sourcing the code today, we hope our commitment to openness will empower the rest of the community to join us in tackling big research questions, and we plan to release more of our source code in the future.
Ai2 Paper Finder also differs in goals and scope. Many other tools focus on returning a few popular results, such as Perplexity. Paper Finder also aims to cover the long tail of more niche findings and hard-to-find papers that requires an iterative process in which you learn new information as you go, which guides your future moves. We believe this scope can better serve researchers who are experts in their fields.
Other efforts (including Ai2 ScholarQA) are creating research summaries. Summaries are based on retrieval, but are different from paper finding. The difference is both in the form in which the results are presented (a list or a summary) but also on how the information is intended to be consumed. Summaries are meant as overviews, and are not meant to be exhaustive: if you get one prominent paper from an area, it is OK to neglect others. While in paper finding, we often want our results to be significantly more exhaustive. While summaries are mostly intended to learn about a new topic, paper finding helps you dig deeper into areas you already know.
Finally, some tools, most notably Undermind, are working in the same domain with a similar goal of finding good and exhaustive sets of papers on a topic. We’re excited to have a wider community working on these challenges and moving the field forward together.
How does Ai2 Paper Finder work?
We now enter the technical part of this post, describing some of the nuts and bolts behind the Paper Finder system.
We focus on describing the single-turn functionality, what happens from the time you issue a query and until the Paper Finder returns a result-set as a response. For now, we ignore additional user interaction mechanisms, like responding to follow-up messages by a user (e.g., "these results are great, but now let's focus on..."), asking the user for clarifying questions about the queries before performing the search, or refusing to perform requests outside its expertise. These are supported to some extent in our demo, but we consider them work in progress: we are still exploring possible solutions, and expect the end product to be substantially different than what we have now. We keep the description here at a relatively high-level without the nitty-gritty details, which we keep for an upcoming tech report.
The bird’s eye view
What happens between the time the user issues a request and the returned result-set? Ai2 Paper Finder works in a semi-rigid flow: a predefined structure that is influenced at key points by various LLM decisions. (Of course, it would be cool if the LLM component were more dynamic and allowed more autonomy in how it influences the flow. We are looking into that, so stay tuned!)
First, the query arrives at the query analyzer, a component that looks at the query and breaks it down to intents and components. For example, is the query intended to find a set of papers ("open source LLM benchmarks that focus on fairness"), or a specific known paper ("the olmo paper")? Does the query mention metadata criteria (author, venue, years, ...) and if so, which? What are the semantic criteria in the query? Does the query mention qualities like "central", "popular", "recent"? These details are then packaged into a data structure, and passed to the query planner (or "router").
The query planner takes the analyzed query, and comes up with an execution strategy. This strategy depends on the search goals, and also on what's available in the different APIs and what will be efficient to execute. Our current implementation routes to one of several pre-defined sub-flows: specific paper search, semantic search with potential metadata constraints, pure-metadata queries, and queries that involve an author name.
Each of these sub-flows (that can be thought of as "sub-agents") return a set of paper-ids, and in sub-flows that include a semantic criteria, each paper-id is also associated with a list of evidence, and a ranking score reflecting its judged adherence to the semantic criteria. These results sets are then re-ranked according to a formula that combines the semantic relevance with metadata criteria, such as prioritizing more recent and more highly cited papers. This is influenced by query modifiers such as "recent", "early", "central" or "classic" that increase the emphasis on the metadata criteria over the semantic one.
Sub-flows: specific papers & semantic search
Finally, let's describe some of the sub-flows themselves, focusing on the two semantic ones: "specific paper" and "semantic search".
The specific-paper sub-flow tries three different strategies in parallel:
- Semantic Scholar title-search API
- Searching for sentences containing the core search terms, focusing on those that also contain a citation and looking for what the majority of them cite (for example, in the query “the alphageometry paper”, it will search for “alphageometry” and see what is cited in its vicinity)
- Asking an LLM for the title and then verifying its existence in the Semantic Scholar title API
It then combines the results into a unified set.
The semantic search sub-flow is the most involved. It creates several reformulations of the semantic criteria using an LLM, and each one of them is hitting 3 different dense indices, where each index has somewhat different properties, and each of them returns paper snippets that can be either sentence-length or longer (depending on the index). It also issues a keyword-based query to the Semantic Scholar API. It then looks for citations within the returned snippets, and adds papers that are cited by many snippets to the paper-candidates pool.
Then, snippets are grouped according to the papers they came from, and an LLM-based relevance component then judges the candidate papers for relevance (see relevance judgement section below), until sufficiently many highly relevant papers are found, or until a high enough number of candidates are scanned.
The search process then enters another round, based on the most relevant papers so far. In this round, an LLM reformulates more queries based on the original query and the text of the found relevant papers, and sends them to the above-mentioned indices. Additionally, it does both forward and backward citation tracking based on the most relevant papers, which is again followed by LLM-based relevance judgment. The process continues for several rounds and stops when it either finds enough papers or scans too many candidates.
Judging the relevance
In the relevance judgment phase, an LLM goes over all the candidate papers and judges how well they match the semantic criteria in the user's query. For each candidate paper, we format it as markdown text that includes the paper's title, abstract, and the returned snippets from the search, where the snippets are ordered by the paper order, and include the section titles of the sections in which they appeared. If the returned snippets include snippets that cite this paper, these snippets are added as well.
Developing a reliable method for relevance judgment was challenging. While the system is still evolving, we found a "mini breakthrough" that significantly improved results and usability: ask an LLM to break the user's semantic criteria into multiple semantic sub-criteria, to be verified separately. For example, in the above query about dialog datasets, these sub-criteria would be “introducing an unscripted dialog dataset”, “English language”, “annotated speaker properties” and “relation between dialog and annotation”. Each of these titles also includes a brief elaboration. Then, the relevance-judging LLM is asked to first provide adherence to each of the individual sub-criteria, which are only then combined into a final score and final judgement.
Batched multi-armed bandits
Relevance judgement is expensive and time-consuming, and we'd like to minimize the number of LLM calls (that is, the number of judged papers). This saves both money and time. To achieve these, we want to order the candidate pool in a way that prioritizes papers that are more likely to be relevant over others. Note that in Ai2 Paper Finder, we get results from several sources: different dense indexes and different query formulations, forward citations, backward citations, Semantic Scholar queries, and LLM suggestions. Each of these sources may be internally ranked (i.e., by the semantic similarity scores), but different sources work better or worse for different queries (maybe a given query is answered best by citation tracking, and for another query, one specific index-reformulation pair is especially effective). How do we efficiently sample from the different sources? We model this as a batched multi-armed bandit problem, and adaptively learn how to sample. We found this to be effective by reducing the number of judged candidates and reducing response times.
Fast mode
As you may imagine, the full process is effective but rather lengthy. There are many LLM calls, and in particular, the relevance judgement step calls an LLM for each judged paper. Although we do parallelizing and prioritizing, the process still takes a long time. The wait time could be worthwhile if the results are high quality and hard to find otherwise, but are they really that hard to find otherwise? Sometimes not. In fact, we found out that for many of the queries appearing in benchmarks (and also many queries from internal users at Ai2), our process above may be overkill. For this reason, we also introduce a fast mode that does less work: it retrieves fewer papers in the initial stage based on the user's semantic criteria without additional reformulations and without the follow-up iterative procedure.
This fast mode is the mode that runs by default, so you don't wait two or three minutes for each response. Based on the results, you can then ask Paper Finder to "work harder" in which case it will invoke the more exhaustive mode described above. You can also invoke the exhaustive mode directly by asking Paper Finder for "an extensive set of paper about X" or something similar in the original query. This way you can get good and (relatively) fast answers to 80% of your queries, while getting higher quality and exhaustiveness for the 20% of queries that require the exhaustive mode.
How well does Ai2 Paper Finder do?
We tested our system on several internal development sets, and are also dogfooding it for a while by using it internally, first in the Paper Finder team and later with the wider Ai2. We have been using it daily, and we are overall happy with the result quality compared to other tools. This is subjective, though. We are currently compiling a set of challenging queries, which we aim to release as a benchmark.
We also evaluate on LitSearch, a recent academic benchmark from Princeton, focusing on academic literature retrieval, which we found to be of high quality. We evaluate on a somewhat different setup than the paper suggests: rather than searching within the set of 64,183 ACL and ICLR papers provided by LitSearch, we search within the much larger set of several million papers in Semantic Scholar, which is a substantially harder task. Nevertheless, we found Ai2 Paper Finder to perform remarkably well, managing to find perfectly relevant papers for 89% of the queries (and highly relevant papers for 98% of them). These numbers drop to 85% perfectly relevant and 97% highly relevant in fast mode.
Our path ahead
We find Ai2 Paper Finder to be incredibly useful for our day-to-day research, and encourage you to use it, especially on queries that traditional search engines like Semantic Scholar or Google Scholar do not work well on, and when Perplexity gives you fewer results than you’d like. The project itself, however, is just beginning, and there is still a lot to be done. Here are some of what looms ahead:
We didn't invest in metadata (e.g., queries that look not only at the content but also at things like authors, years, publication venues and so on) as much as we'd like, and queries involving metadata are still rough around the edges. We are aware of that and are actively working on improving the results for queries that involve metadata. Handling metadata well is more challenging than one may expect, and we're progressing rapidly (and rolling out improvements to the demo when they are available). Stay tuned for updates on this front.
As for the semantic queries, while we get top results on academic benchmarks such as LitSearch and Pasa, there is still a lot to do. In particular, we’ve already identified several areas which are particularly challenging: queries when the user does not know the right vocabulary, overly long and rambling queries where the user enters a very long, paragraph-length description of their intents, some queries that involve a combination of multiple semantic criteria where each of them appears in different part of the paper, and queries that search for things that are inherently hard to search for using an index (e.g. numeric ranges such as in "training techniques for models with more than 7b parameters", or negated semantic criteria as in "fairness papers that do not discuss race or gender").
Another area we recently started to explore is interactivity and multi-turn interactions. Real world search is not a one-shot process: once there are results, the searcher may like to refine the query. This refinement may refer to the returned results ("these are great but now focus on memory efficiency" or "the third and fourth are great, can you find more like these"), and we'd like the follow-up queries to take this into account. This opens up many research, engineering, and UX questions which we are now exploring. Personalization and proactivity from the searching agent are also on our horizon. Today Ai2 Paper Finder does an OK job on various followup queries, but we hope the next versions will do a much better job at a much larger range of queries. With the first version released, we are also eager to see your follow-up queries to learn what kind of interactions you are most interested in.
Finally, the system is now strong but quite rigid, and while it is influenced by LLM decisions, the flows are predominantly shaped by the researchers and engineers in our team. This is powerful and effective but also limiting (as an almost trivial example, a query like "the bert paper and the roberta paper" is currently not handled well, and could be trivially supported by a more dynamic, LLM-controlled flow). Going forward, we'd like to see more and more decisions delegated to the LLM, supporting more dynamic and ad-hoc flows.
Ai2 Paper Finder is part of a larger vision that we have here at Ai2, one of an agentic scientific research assistant. We aim to help advance science by supporting all research needs, from paper finding, literature organization and understanding, through experiment design, statistical analysis, and experiment execution. More to come!
Credits to everyone who worked on this project: Dan Bareket, Aryeh Tiktinsky, Micah Shlain, Mark Polak, Ben Eyal, Menny Pinhasov, Sigal Rahamimov, Uri Katz, Guy Wiener, Yoav Goldberg, Chloe Anastasiades, Matt Latzke, Smita Rao, Cecile Nguyen