Building Elite Retrieval Systems for Research
State of the art retrieval systems are crucial for navigating the ever-growing body of research literature. Here's how we built one for Open Paper.
Rest assured that RAG is not yet dead. Retrieval-augmented generation (RAG) refers to systems that combine a static knowledge base with a generative model to produce contextually relevant outputs. Two years ago, a lot of RAG systems relied on vector databases and used semantic search to pull in related context, passing it to an LLM for final inference. But I think a lot of application developers (including our team) felt the limitations of this design.
Maintaining an embeddings layer can be fairly costly - the inference itself for generating the embeddings, and then the storage cost for keeping them, especially for many items which have very low read rates, may not be worth it. Whenever you want to change your embeddings model, your entire corpus needs to be re-computed. Plus, it leaves open the problem of reconstructing the source text - an embedding can match to a chunk, but that chunk may be part of a greater paragraph or chapter that can only be understood within context of the surrounding text. This makes it difficult to build good systems for answering exploratory sensemaking questions. We learned this the hard way while building khoj.
I noticed sometime last year that coding assistants were starting to move towards more greedy implementations. Rather than using expensive, pre-computed embeddings on the data itself, LLMs have been sufficiently trained to leverage search tools (like grep) to look for semantically similar words or expressions in the corpus. This is surprisingly robust.
When it comes to knowledge bases in Open Paper, sizes can get fairly large. Currently, we support up to 500 papers and 3GB. State of the art models can ingest up to 1M tokens. But 500 papers, assuming they each have about 15 pages and 6000 words, would total around 3M+ tokens. Models currently still can't hold that much context in their attentions, so some context engineering for efficiently packaging information for consumption is necessary. Plus, we know that just because a model can take in 1M tokens, it doesn't necessarily mean it should. Models will still perform better when provided targeted information.
To manage this effectively, we store an index of all your papers in a Postgres DB. On initial upload, we extract all the associated text, the authors, the abstract, and the institutions.
When you ask a question in the multi-paper scenarios (either projects or the /understand page), the model is given access to the titles of all the works in your project/library. From there it can pick the best tool to dig into the works to answer your question:
read_abstract: Reads the abstract of a single file.search_knowledge_base: Enters a search term to search across the entire set of reference works.search_file: Enters a search term to search within a specific work.view_file: Inspect a specific file based on a range of line numbers.read_file: Read a full file.
You can watch OP undertake its investigation in the thinking trace while it's working on your question.
In a perfect world, our models would read_file their way through the whole knowledge base, but we know that they can't do that due to limits in the aforementioned context window. Hence, they use all these other tools at their disposal to pinpoint an answer to your question. They'll typically start with a call to search_knowledge_base, and use that as a starting point to dig in deeper.
Both of the search functions return line numbers in their results, so they can be used in conjunction with view_file, which allows the model to specify a range for its search. The model is given a few iterations for retrieving evidence for answering its question, before it has to finalize and respond.
The tool orchestration is done (as of writing this in Spring 2026) by a model called kimi-k2-thinking (see the github). This model is downright excellent at tool calling, because it was specifically RL-ed on a variety of tool execution scenarios that helped it improve this skill. It's quick and accurate with function calling. We've also been testing Zai's GLM model, which is also very effective.
These models, despite being great at tool selection, are not exactly smart enough for the final response - for that, we use Gemini 3. This model is robust for giving well-informed, accurate, grounded answers. We have a fairly gnarly citation protocol that the agent has to adhere to while it's streaming a response, and a lot of models botch this. Importantly, this more intelligent model also comes with a high latency cost.
+----------------+ +-------------------------------------------------+ +-------------------+
| User |----->| Server |----->| LLM |
+----------------+ | (multi_paper_operations.py) | +-------------------+
^ | | ^
| | 1. gather_evidence(question) | |
| | - Iteratively calls LLM with tools: | |
| | - search_all_files(query) |--------------+
| | - read_file(paper_id, query) |
| | - ... |
| | - Compacts evidence if it gets too large |
| | |
| | 2. chat_with_papers(question, evidence) |
| | - Sends evidence and question to LLM |--------------+
| | - Streams response back to user | |
| | - Parses citations from response | |
| +-------------------------------------------------+ |
| | |
+---------------------------+----------------------------------------------------+
(Streamed response with citations)
This is sort of the happy path - worker model collects evidence, hands off to smarter model, and constructs a final response. But often, towards the end of that information collection stage, we end up accumulating a lot of cruft in collected evidence. We end up nearly back at square 1, where the data we have exceeds the limits of our model's context window.
To mitigate this, we've implemented compaction. Compaction allows the LLM to undergo a data compression stage, where it peers through the evidence it's collected for each of the papers, relative to the target question, and it summarizes it.
So, in theory, we can reduce the size of the evidence from 1000 lines to 10, depending on the execution of the compaction. Abstraction loses accuracy, so the LLM is forced to generate citations for the compaction step itself. That way, we know which snippets were actually used for it to provide its answer. Finally, the final response agent takes in the compacted evidence, rather than a huge dump that it would not be able to process.
But now we've lost all of our granular citations.
In order to reconstruct the citations for the final answer, we use reference to the compacted summary to expand back out to the relevant snippets that were used to power it. In turn this allows us to zoom-in and zoom-out through the relevant steps of our work, while giving the model the minimal relevant context it needs to put together a more useful response to the target question.
Vector embeddings can still be powerful because they make your read-time compute very, very cheap. But they require a lot of manual reconstruction. Our search_knowledge_base step could be augmented with a cosine similarity layer, but it's not strictly necessary, nor always beneficial.
The whole process of generating accurate, grounded responses is still evolving. And honestly, it's still imperfect. The AI system still makes mistakes and requires manual intervention. That's why we've taken a design approach that makes it easy for you to jump back into context when needed, with minimal loss to flow.
The broader lesson we've taken away from building this is that retrieval doesn't have to be a solved-once, static layer. By letting the model actively search and navigate your knowledge base — rather than relying on a precomputed index to do the matching for it — you get a system that's more adaptable and, in practice, more accurate. The tradeoff is latency and compute at read time, but as models get faster and cheaper, that tradeoff keeps getting better. We're betting that the future of retrieval looks less like a database lookup and more like a dynamic research assistant that knows how to dig.