Hybrid Search in eDiscovery: How AI Retrieval Actually Works

Every eDiscovery vendor sells AI search. Few explain it. Inside the actual hybrid retrieval stack: BM25, embeddings, RRF, and rerankers.

By Claude and Gemini with Sid Newby | April 2026

Run two queries against the same four-million-document review. The first: severance package non-compete. The second: separation agreement restrictive covenant. A reasonable lawyer would expect significant overlap in the results. In 2026, you will probably get two different document sets. Which one is more accurate depends on which retrieval engine the platform used. It depends on how that engine merged candidates from two scoring systems. And it depends on whether a reranker got the final say.

The phrase AI search appears in every eDiscovery vendor pitch deck this year. Almost no buyer can sketch the architecture that sits behind that phrase. The asymmetry is structural. Vendors sell capability and abstract it away. Buyers approve the cost without seeing the machinery. And when a privilege document gets missed or a hot document gets buried, nobody can say which layer of the stack failed.

This post is the missing diagram. It walks through the actual retrieval pipeline running underneath every modern review platform. BM25. Dense vector embeddings. Hybrid fusion. Cross-encoder reranking. The chunking decision that quietly determines whether any of it works. None of this is proprietary magic. It is published computer science from the information retrieval and NLP communities, repackaged for litigation.

If you are responsible for buying, defending, or producing from a review platform in 2026, you should be able to read this and ask better questions on your next vendor call.

What Search Used to Be in eDiscovery

Until roughly 2018, search in eDiscovery meant two things: Boolean keyword expressions, and concept search built on a technique called latent semantic indexing.

Boolean keyword search is exact-match string lookup. (severance OR separation) AND (non-compete OR "restrictive covenant") w/15 termination — the syntax is grim and the results are exactly what the syntax says. Spelling matters. Hyphens matter. A document with "noncompete" instead of "non-compete" gets dropped. Synonyms outside the query do not exist to the system.

Concept search added a layer of statistical co-occurrence math. The idea: which words tend to appear in the same documents, and therefore probably mean related things. Practitioners called this technology-assisted review when used to score documents for relevance. Magistrate Judge Andrew Peck famously blessed the practice in Da Silva Moore v. Publicis Groupe in 2012.^[1] That ruling was the first published federal endorsement that an algorithmic approach could replace linear keyword review. It also set the validation playbook the industry has used since: train the classifier on a seed set, run it across the population, sample, measure recall and precision, document everything for defensibility.

That playbook works for the math underneath it. The math underneath modern AI review is different.

Figure 1: How the retrieval stack underneath eDiscovery review platforms evolved from Boolean and concept search to modern hybrid retrieval.

The shift that matters happened around 2018. Dense vector embeddings, powered by transformer models like BERT, became cheap enough to run at production scale. By 2024 they were standard in commercial RAG and search products. By 2026 every major review platform — Relativity, Everlaw, Reveal, DISCO, HaystackID — runs some flavor of hybrid retrieval underneath its review interface. Most never explain it on the marketing site.

Relativity made its aiR for Review and aiR for Privilege generative AI products free inside RelativityOne in early 2026.^[2] Everlaw matched by including its EverlawAI Assistant features at no charge. DISCO bundled Cecilia into its single per-GB price.^[3] When the marketing copy says "AI-powered search," what it usually means is a hybrid retrieval pipeline feeding an LLM. Each of those words is a separate engineering decision. Each decision has a way it can break.

BM25, Mechanically

BM25 is the modern descendant of TF-IDF, the term-weighting formula that ran search engines for thirty years. It is a sparse retrieval algorithm — sparse meaning that documents and queries are represented as long vectors mostly full of zeros, with a non-zero score only for the specific terms that appear.

The scoring formula is unsentimental. For each query term, BM25 multiplies three things together. First, how often the term appears in the document (term frequency). Second, how rare the term is across the whole collection (inverse document frequency). Third, a length normalization factor that prevents long documents from dominating just because they contain everything.

What this means in practice:

A term that is rare in the corpus and frequent in a specific document scores very high. Bates number BAT00457893 is rare in the corpus and frequent in the document containing it, so BM25 will surface it instantly.
A term that is common in the corpus — agreement, the, party — scores near zero regardless of how often it appears in any given document.
Documents that contain all the query terms score higher than documents that contain only some of them, but a single rare term can still dominate.

BM25 is what Elasticsearch's default ranker uses. It is what every "keyword search" feature in your review platform actually runs. It gets credit when you find a hot document by typing in a known phrase. It also fails predictably when the hot document phrases the concept differently than the query did.

Search the example query above for non-compete. BM25 will not return the document that says the Employee shall not, for a period of twenty-four months following Separation Date, engage in or be associated with any business that competes directly or indirectly. The semantic content is identical. The lexical overlap is approximately zero. BM25 does not see it.

Recent benchmarks put pure BM25 retrieval at roughly 65% recall@10 on standard evaluations.^[4] Translation: if you only run BM25, you will miss about a third of the documents in your top ten that should be there. The documents you miss are the ones that don't share vocabulary with your query.

Vector Embeddings, Mechanically

A dense embedding is a learned representation. You take a transformer model — modern systems often use a fine-tuned model like nomic-ai/modernbert-embed-base or commercial offerings from OpenAI and Cohere — and feed it a chunk of text. The model returns a fixed-length list of floating-point numbers. Typical lengths run between 384 and 1,536 dimensions. That list is the embedding.

The model has been trained on billions of text pairs. The training goal: produce embeddings such that semantically similar text ends up near each other in this high-dimensional space. Severance package and separation agreement land close together. Severance package and baking instructions for shortbread land far apart. Distance is measured by cosine similarity or dot product. Geometry, not string matching.

To run a search, you embed the query the same way. Then ask the vector database which document embeddings are nearest to the query embedding. Modern vector databases — pgvector, Pinecone, Qdrant, Weaviate, Milvus — use approximate nearest neighbor algorithms to do this in milliseconds across millions of documents. HNSW is the most common.

The capability this unlocks is real. The system can return the Employee shall not, for a period of twenty-four months in response to non-compete. The embedding model has learned that those phrases live in the same neighborhood of meaning. Concept search from 2008 tried to do this with co-occurrence statistics. Embeddings do it with learned representations that capture meaning in a way co-occurrence statistics never could.

Pure dense retrieval reaches about 78% recall@10 on the same benchmarks where BM25 hits 65%.^[4] Better, but not great. The failure modes are specific:

Identifier searches break. Bates number BAT00457893 gets embedded as a generic alphanumeric blob. The vector model has no special knowledge that this is a document identifier; it just looks vaguely like other identifier strings. The exact-match document may not even be in the top fifty.
Negation gets ignored. Documents discussing the agreement but not signed by Robertson embeds almost identically to documents discussing the agreement signed by Robertson. Embedding models capture topical similarity, not logical structure.
Out-of-domain terms hallucinate similarity. A model trained on general web text may not distinguish between equitable estoppel and equitable distribution. Both involve the word equitable. Both come from legal contexts. They are unrelated doctrines. Domain-specific or fine-tuned legal embedding models help, but most platforms use general-purpose models with light tuning.
Long documents lose information. A 200-page deposition transcript cannot be embedded as a single vector and retain useful resolution. It must be chunked. We will get to chunking shortly. It is where most of the failures actually live.

The Hybrid Stack

The recall ceiling on either method alone is the reason every modern eDiscovery search runs both, then fuses the results. Add a reranker to that fusion, and you get the architecture diagrammed below.

Figure 2: The hybrid retrieval pipeline that runs underneath modern eDiscovery search. Two independent retrievers fan out, a fusion step combines their rankings, and a cross-encoder reranks the survivors before anything reaches the user.

The fusion step is almost always Reciprocal Rank Fusion, or RRF. The formula is genuinely a one-liner. For each document, sum 1/(k + rank) across all the ranking lists it appears in. The k is a smoothing constant typically set to 60.^[5] If a document is ranked #1 by BM25 and #4 by vector search, its RRF score is 1/(60+1) + 1/(60+4) = 0.0316. The document with the highest sum wins. That is the entire algorithm.

What RRF gets right is that it is unitless. BM25 returns scores in the range of 0 to maybe 30. Cosine similarity returns scores between -1 and 1. Adding those raw scores produces nonsense. Normalizing them produces brittle nonsense. RRF throws away the magnitudes entirely and looks only at rank position. A document that ranks well in both lists wins. A document that ranks brilliantly in one and not at all in the other gets a moderate score. It is a remarkably stable algorithm. Elasticsearch, OpenSearch, Weaviate, Qdrant, and Azure AI Search all use it under the hood.^[5]

Hybrid retrieval using RRF reaches roughly 91% recall@10 in published benchmarks. That is a meaningful jump from the 78% you get with dense retrieval alone.^[4] On harder evaluations like the BRIGHT benchmark, hybrid gains can reach +24% over either method alone.^[6]

The optional final stage is a cross-encoder reranker. Bi-encoder embeddings used for vector search compute query and document representations independently, then compare them with cosine similarity. A cross-encoder model takes the query and a candidate document together as a single input and produces a relevance score directly. It is much more expressive. The model can attend to interactions between specific query terms and specific document terms. It is also much slower. You cannot run a cross-encoder across four million documents. You can run one across the top 50 candidates that survived RRF, in roughly the time it takes to render the results page.

Anthropic's Contextual Retrieval research was published in late 2024. It reported that adding a contextual chunking step plus a cross-encoder reranker reduced retrieval failures by 67% versus baseline embedding-only retrieval.^[7] That number is the closest the industry has to a published benchmark for the full pipeline working well. It is the architectural target every serious vendor is shooting for.

The Chunking Decision That Quietly Determines Everything

A 50-page contract is not one document to a vector embedding model. It is somewhere between thirty and three hundred chunks, depending on how the platform decided to split it. That decision is invisible in the UI, rarely documented in vendor materials, almost never raised in ESI protocols. It is where most retrieval failures actually originate.

Chunking is the operation of breaking a source document into pieces small enough to embed. Embedding models have hard input limits, typically 512 to 8,192 tokens. A long document must be cut. The cuts can be:

Fixed-token windows with optional overlap. Simple, fast, blind to document structure. A clause that ends mid-section will get its non-compete language in one chunk and its consideration language in the next. Searching for either may miss the connection.
Recursive splitting at natural boundaries — paragraphs, headings, sentences. Better for structured documents, which legal documents almost always are. Most current best-practice guides recommend recursive splitting at 256 to 512 tokens with 10-20% overlap.^[8]
Semantic chunking that uses an embedding model to detect topic boundaries. Slower, sometimes better, often not worth the cost.
Contextual chunking as described in the Anthropic research, where each chunk is prefixed with a short summary of the document it came from, generated by an LLM at index time. Expensive at ingestion, much higher recall at retrieval.^[7]

A 2026 systematic analysis identified what the authors called a "context cliff" around 2,500 tokens. Beyond that chunk size, response quality drops sharply. Separately, 256-token chunks consistently outperformed 384-token chunks on precision.^[8]

Translate this to legal review. A force majeure clause buried in a 90-page master services agreement may end up in a chunk that contains the clause but lacks the contract's parties, governing law, or term provisions. The embedding for that chunk will be about force majeure. It will not be about the parties or the dispute context. A query for force majeure invocation by AcmeCorp may retrieve the chunk weakly. The embedding does not know the chunk is from an AcmeCorp contract. The document is in the corpus. The chunk just doesn't carry enough context to come back when it should.

This is the silent failure mode that nobody puts in a defensibility brief. You cannot validate against it with a sampling protocol because the missing documents do not surface in the sample — they are missing from the candidate set the sample is drawn from.

Figure 3: How a chunking decision can cause a relevant document to score weakly. The force majeure chunk loses the party identity established in the recitals chunk, so a query naming the party retrieves the document worse than it should.

The mitigations are real engineering work. Contextual chunking. Hierarchical retrieval that searches at parent-document level after chunk-level recall. Metadata filtering that lets the platform restrict results to documents from a specific custodian or date range before ranking. Every serious platform implements some combination. Almost none expose the configuration to the buyer. You are usually choosing between vendor defaults that you cannot inspect.

What This Does to TAR Validation

The validation framework from Da Silva Moore assumes a closed mathematical loop. A classifier scores every document in the population. You sample at known confidence intervals. You measure recall and precision against ground-truth coding. The framework works because the classifier sees every document.

Hybrid retrieval-plus-reranker does not see every document the same way. It sees a candidate set selected by BM25 plus a candidate set selected by dense vector search, fuses them, then reranks the survivors. Documents that did not surface in either retriever do not get reranked. Chunks that did not carry sufficient context may not surface even when their parent document is relevant. The "population" the validation sample needs to represent is no longer the document corpus. It is the union of what the retrieval pipeline managed to find. Those are different sets.

This is one of the reasons the Sedona Conference Working Group 13 has been pushing on agentic-AI ESI protocol disclosures and decision-trail logging.^[9] The defensibility question is no longer "did the classifier achieve 75% recall on a representative sample?" It is closer to a different set of questions. Did the retrieval pipeline surface every chunk that ought to have been retrieved? Did the reranker preserve the signal? Can we audit any of this? The answers depend on architectural choices the platform makes silently.

The eDiscovery Today decision-tree framework for AI-generated evidence was published in late April 2026. It includes authentication and reliability nodes that touch directly on this problem. A producing party must be able to show three things. The retrieval and analysis methodology was sound. It was repeatable. It was capable of surfacing the evidence the case turns on.^[10] That standard is hard to meet when the producing party cannot see the chunking strategy, the embedding model, or the fusion configuration.

A Practical Framework for Evaluating Vendor "AI Search"

Here is the question set that distinguishes a serious platform from a marketing-led one. None of these questions are unfair. All of them have answers if the vendor knows what is running underneath the interface.

Question	What it tests	What "I don't know" means
What embedding model do you use, and is it fine-tuned on legal text?	Domain fit and disclosure	The vendor licensed an opaque API and never tested domain performance
What is your chunking strategy for documents over 8K tokens?	Whether long documents work	Whether long documents work is a coin flip
Do you fuse keyword and vector results, and using what algorithm?	Hybrid awareness	The platform is dense-only or keyword-only, masquerading
Do you apply a cross-encoder reranker, and on how many candidates?	Final-stage precision	Precision at the top of the list is unverified
Can I export the candidate set before reranking, for my own analysis?	Auditability	The black box is sealed
How do you handle queries containing Bates numbers or other identifiers?	Sparse retrieval awareness	Hot exhibits are findable only if you know the right phrase
What is your published recall@10 on legal-domain evaluations?	Whether evaluation exists	The platform was never benchmarked against its peers
Can I add metadata filters before retrieval, or only after?	Pre-retrieval filtering	Privilege filtering happens too late

Table 1: Vendor evaluation framework for hybrid retrieval architectures in eDiscovery review platforms. Source: synthesized from current information retrieval research and published vendor architectures.

The honest answer to several of these from most vendors in 2026 will be a short pause followed by let me get back to you. That pause is the point. The buyer should be able to ask the question.

What This Means for the People Who Pay the Bills

Step back from the architecture for a moment. A small plaintiff's firm running a wrongful-termination case cannot afford to miss the critical email because the chunking strategy buried it. A solo immigration practitioner cannot afford to over-produce because the retrieval pipeline returned a thousand near-duplicate candidates. A reranker could have collapsed those down to the twelve that mattered.

The vendor pricing reset of 2026 lowered the sticker price on capability that used to cost thousands per gigabyte. Relativity bundled aiR for free. Everlaw matched. DISCO ships a single per-GB price. That solves part of the access-to-justice problem. It does not solve the asymmetry problem. The vendor knows what its retrieval pipeline does. The buyer does not. An architecture you cannot see is an architecture you cannot challenge in motion practice. A chunking strategy you cannot inspect is a chunking strategy you cannot validate.

That asymmetry is fixable. Most of what is in this post is published research, available to anyone with a search engine and the patience to read information retrieval papers. The vendors are not hiding it because it is proprietary. They are hiding it because most buyers have not asked.

The next time a vendor walks into your office and demos AI-powered search across your dataset, ask which retrievers are running. Ask what fuses them. Ask whether a cross-encoder reranks the top candidates. Ask what chunk size, with what overlap, on what embedding model. If the answers are coherent, you are talking to a serious engineering team. You can negotiate accordingly. If the answers are vague, you are about to spend money on a feature whose failure mode nobody at the vendor has bothered to map.

The technology underneath modern eDiscovery search is genuinely better than what came before. The buyer's leverage depends on the buyer being able to read the architecture. That leverage means the ability to compare platforms, validate productions, and challenge opposing counsel's methodology. None of this is too technical for litigation professionals to understand. Most of it has a clean diagram. We have just been asked, for too long, to take it on faith.

The post-faith era of eDiscovery search starts with reading the diagram.