You shipped a RAG chatbot last quarter. Embeddings, vector database, prompt template, the lot. Demo went great. Three months in, the support team is finding answers that are technically in the corpus but consistently the wrong ones – close enough on the embedding to rank highly, but not actually what the question was asking. You crank the top-k from 5 to 20, the LLM gets confused by the noise, and the answers get worse. You’re stuck.
The fix is a step you skipped.
In To LLMs… and Beyond! we covered RAG – retrieval-augmented generation – as a two-step pattern: embed the query, retrieve relevant documents, generate the answer. That’s the correct shape for explanation. It’s also the wrong shape for production. Most working RAG systems have three steps, and the missing middle one is where the quality lives.
This post is about that middle step.
Why a single retrieval pass isn’t enough
The retrieval step in RAG uses what’s called a bi-encoder: an encoder model (usually BERT-family, see The Other Transformers) that produces a single vector for each piece of text. The query gets one vector. Each document gets one vector. You compare them by cosine similarity – the closer the angle, the more similar the texts.
This is fast. Embarrassingly fast. You can pre-compute the document vectors once and store them in a database. At query time, you only need to embed the query (a few milliseconds) and find the nearest neighbours (a few more milliseconds, even across millions of documents). It scales to web-search levels.
It’s also kind of dumb.
The bi-encoder embeds the query and the document independently. The model never sees them together. It produces a vector for the query that captures the query’s meaning in general, and a vector for the document that captures the document’s meaning in general, and then you compare those two general representations. There’s no opportunity for the model to notice that this specific query is asking about a specific aspect of this specific document.
In practice this means bi-encoders are good at finding documents that are topically related to the query. They’re less good at finding the documents that actually answer the query. Two documents about the same topic can have very similar embeddings even if only one of them contains the answer.
For a vague question like “what’s our refund policy?” topical similarity is enough. For a specific question like “can I get a refund on a digital download after 30 days if I haven’t used it?” you need a model that can read the query and the candidate documents together and decide which one actually addresses the conditions.
That’s a cross-encoder.
What a cross-encoder is
A cross-encoder is the same architecture (an encoder transformer) used a different way. Instead of producing a vector for each text, it takes a pair of texts – query and candidate document – and produces a single relevance score.
The query and document get concatenated with a separator token, fed through the model together, and the model’s full attention mechanism gets to see every query token attend to every document token and vice versa. The output is one number: how well does this document answer this query?
[CLS] can I get a refund on a digital download after 30 days [SEP]
Refund policy: physical goods may be returned within 30 days. Digital
downloads are non-refundable once purchased. [SEP]
The model reads that and outputs, say, 0.91 – the document is highly relevant because it directly addresses both “digital download” and “refund,” even though the answer is “no.” A different document that only mentions the 30-day window for physical goods might score 0.34.
Cross-encoders are dramatically more accurate than bi-encoders for relevance. They’re also dramatically slower. Because the model has to see the query and document together, you can’t pre-compute anything – every query against every candidate is a fresh forward pass. If you have a million documents and you ran the cross-encoder against all of them, you’d be waiting weeks per query.
Which is why you don’t do that. You do retrieve-then-rerank.
The two-stage pattern
The standard production RAG pipeline is:
- Retrieval (bi-encoder). Embed the query, find the top 50-200 candidate documents from the vector database. Fast, parallel, scalable.
- Reranking (cross-encoder). Score each of those candidates against the query using a cross-encoder. Pick the top 3-10 by score.
- Generation (LLM). Pass the top reranked documents into the LLM along with the query. Generate the answer.
The retrieval stage is “we cast a wide net, fast.” The reranking stage is “we read each catch carefully, slowly, but only the ones in the net.” Together they let you get cross-encoder-quality relevance at bi-encoder-scale corpus sizes.
The numbers are striking. For a corpus of one million documents:
- Bi-encoder only: ~10ms per query, mediocre relevance.
- Cross-encoder only: ~1,000,000 model calls per query. Untenable.
- Bi-encoder + cross-encoder: ~10ms retrieval + ~200ms reranking on 100 candidates = ~210ms total, with relevance approaching cross-encoder-only quality.
That third option is what every serious RAG system is doing. The blog posts that don’t mention it are showing you the demo, not the production system.
Models you can actually use
Reranker models are a small but mature corner of the open-source ecosystem.
| Model | Made by | Open / closed | Notable for |
|---|---|---|---|
| BGE Reranker (v2-m3, large) | BAAI | Open | Strong default, multilingual, well-supported |
| Cohere Rerank | Cohere | Closed (API) | Easy integration, multilingual, pay-per-call |
| Voyage Rerank | Voyage AI | Closed (API) | High quality, instruction-tuned variants |
| ms-marco-MiniLM-L-6-v2 | sentence-transformers | Open | Tiny (22M params), runs on CPU, fine for English |
| Jina Reranker | Jina AI | Open / API | Long-context variants for document-level reranking |
The lightweight ones (the MiniLM cross-encoders, around 20-100M parameters) run on a CPU. The heavyweight ones (BGE Reranker v2-m3, around 568M parameters) want a GPU but produce noticeably better rankings. For most projects the correct starting point is the smallest open model that fits your latency budget; you can swap up if quality demands it.
When reranking earns its keep
Not every retrieval task needs a reranker. The benefit grows with task difficulty:
- Vague topical queries against a small corpus: bi-encoder is fine. “Tell me about our company values” against a 50-document handbook will return the correct document on cosine similarity alone.
- Specific factual queries against a medium corpus: reranker helps. “What’s the SLA for our enterprise tier?” against a thousand-document knowledge base benefits from the cross-encoder noticing that the document mentioning enterprise tier SLAs specifically is more relevant than the one with the same words in a marketing context.
- Long-tail queries against a large corpus: reranker is essential. Web-scale search, code search, scientific literature search – the bi-encoder will return a heap of plausible-but-not-quite candidates, and the reranker is what separates them.
The pattern: bi-encoders fail by returning plausibly-related but not actually-answering documents. If your eval set is full of cases like that, you need a reranker. If your bi-encoder is missing the correct document entirely (it’s not in the top 200), reranking won’t save you – you need better embeddings or a hybrid retrieval strategy. Different problem.
Hybrid retrieval: the other thing you might be missing
While we’re here, the second-most-skipped step in RAG explanations: hybrid retrieval.
Bi-encoders work on semantic meaning. They’re great at handling paraphrase (“how do I cancel?” finds documents about “subscription termination”). They’re weak at exact matches – product codes, person names, error messages, version numbers. The vector for KB-ERR-2847-fatal doesn’t necessarily live near the vector for 2847 in embedding space, because the model has never seen that specific string and treats it as a sequence of arbitrary subword tokens.
Hybrid retrieval combines a semantic search (bi-encoder, dense vectors) with a lexical search (BM25, sparse keyword matching) and merges the results. The semantic search catches paraphrase. The lexical search catches exact matches. The reranker takes the union and sorts it.
In production:
- Semantic retrieval returns top 100 by embedding similarity.
- Lexical retrieval returns top 100 by BM25 score.
- Merge – take the union (often 150-200 documents after dedup).
- Rerank with a cross-encoder, take the top 5-10.
- Generate with the LLM.
This pattern – often called hybrid retrieval with cross-encoder reranking – is the realistic shape of a production RAG system in 2026. The blog-post version with one embedding lookup is the simplification.
A decision table
| Symptom | Likely fix |
|---|---|
| "The correct document is in the top 50 but not the top 5" | Add a reranker |
| "The correct document isn't in the top 50 at all" | Better embeddings, or hybrid retrieval (BM25 + semantic), or chunk differently |
| "It can't find specific product codes / IDs" | Hybrid retrieval -- you need lexical matching |
| "The LLM is confused by too many candidates" | Lower top-k after reranking; trust the reranker to filter |
| "Latency is too high" | Smaller reranker (MiniLM cross-encoders), or fewer candidates into the reranker |
| "Quality varies wildly between users" | Likely a chunking or query-rewriting issue, not a reranker issue |
What’s worth remembering
- Bi-encoders embed independently. The model never sees the query and document together. They’re fast but rough.
- Cross-encoders read together. The model attends across both, producing far better relevance scores – at much higher cost.
- The two-stage pattern is the standard. Retrieve broadly with bi-encoder, rerank precisely with cross-encoder.
- Reranking earns its keep when “topically related” isn’t the same as “actually answers.”
- Hybrid retrieval (BM25 + semantic) is the other thing the blog posts skip. Especially important for codes, IDs, names.
- Pick the smallest reranker that meets your latency budget. MiniLM cross-encoders run on CPU and are often enough.
- If the correct document isn’t in the candidate set, no reranker can save you. Different problem, different fix.
The next post in this series leaves the LLM neighbourhood entirely. Transformers aren’t the only architecture in town – they aren’t even the only architecture good at sequences. Mamba, RWKV, RetNet, diffusion-for-text, and the rest of the post-transformer landscape is starting to matter.
The next chapter, After the Transformer, publishes around 9 May.