When AI Corrupts the Record: The DELEGATE-52 Benchmark

A Microsoft benchmark found AI corrupts about 25% of a document over twenty edits, in changes that pass review. The fallout for eDiscovery.

By Claude and Gemini with Sid Newby | June 2026

This post was drafted with help from Claude and Gemini. According to a new benchmark out of Microsoft Research, both of them would quietly corrupt about a quarter of a document I handed them to edit, given twenty rounds to work on it — and they would do it in ways polished enough to survive a skim. So before this went anywhere, I read every line of it myself. That instinct, it turns out, is the whole story.

A number worth staring at

The benchmark is called DELEGATE-52, and the paper attached to it carries a title that does not bury the lede: LLMs Corrupt Your Documents When You Delegate.^[1] Three researchers — Philippe Laban, Tobias Schnabel, and Jennifer Neville — built a test that does something most AI evaluations never bother to do. Instead of asking a model to answer a question once and grading the answer, it asks the model to live with a document across a long sequence of edits, the way a paralegal lives with a production set across a matter.^[2]

The setup is clever in its cruelty. For each of 52 professional domains — coding, accounting ledgers, crystallography, music notation, the kind of structured and semi-structured files that real work actually produces — the benchmark runs paired tasks. First a forward edit: change this, update that. Then an inverse task that requires the model to put the document back the way it was. If the model genuinely understood and preserved the content, the round trip should return something close to the original. The gap between what comes back and what went in is the corruption.^[3]

They ran nineteen models through it. The frontier systems — Gemini 3.1 Pro, Claude 4.6 Opus, GPT 5.4 — and the older or smaller ones underneath them. And the headline number is the one I want every litigation support manager reading this to sit with for longer than is comfortable: across twenty delegated interactions, frontier models corrupted an average of 25 percent of document content. Averaged across all nineteen models, the degradation hit 50 percent.^[4]

Half. Of the document. Gone or wrong, after twenty passes, by a tool that the entire legal technology industry spent the last eighteen months selling as the thing that would finally let you stop checking its work.

Flowchart of the DELEGATE-52 round-trip benchmark: a document is edited, then restored, then compared against the original to measure corruption

Figure 1: The DELEGATE-52 round-trip design. A model that truly preserves a document's content can undo its own edits and return to the original. The measured gap between the original and the reconstruction is the corruption — and it compounds across interactions.

The degradation has a shape

Raw averages hide the part that matters for evidence. Dig into the per-model curve and a pattern emerges that should change how you think about AI in any document-handling pipeline.

Model	Reconstruction after 2 interactions	Reconstruction after 20 interactions	Total degradation
Gemini 3.1 Pro	96.8%	80.9%	19.1%
Claude 4.6 Opus	94.2%	73.1%	26.9%
GPT 5.4	94.3%	71.5%	28.5%
GPT 4o	45.6%	14.7%	85.3%

Table 1: Reconstruction scores degrade as delegated interactions accumulate. Even the strongest model loses roughly a fifth of its fidelity over twenty rounds; an older model like GPT-4o falls off a cliff. Source: DELEGATE-52, Microsoft Research.^[3]

Two things jump out. The first is that the strong models are not actually avoiding the damage — they are postponing it. The researchers were blunt about it: the better systems "aren't avoiding small errors better, they delay critical failures to later rounds."^[5] The corruption does not seep in gradually, a typo here, a dropped clause there. It arrives in catastrophic single-round drops — losses of ten to thirty points in one interaction — and those sudden failures account for roughly 80 percent of the total damage. The document looks fine, fine, fine, and then one round trip mangles it.

The second is subtler and worse. Weaker models tend to delete content, which at least leaves a hole you might notice. Frontier models do something more dangerous: they corrupt by substitution, producing plausible-looking changes that pass review.^[4] A deleted paragraph is a missing-tooth gap a reviewer catches. A confidently rewritten figure, a transposed account number, a date nudged by a year, a "shall not" smoothed into a "shall" — those survive the skim. The model's competence is precisely what makes its errors hard to catch.

Pie chart: 80 percent of document damage comes from catastrophic single-round failures, 20 percent from gradual drift across rounds

Figure 2: Corruption is not a slow leak. Roughly four-fifths of the total document damage in DELEGATE-52 comes from sudden, severe failures in individual interactions — the kind that no amount of "the output looked clean last time" will predict.

Picture how that plays out on a real matter. A reviewer hands an AI a thousand-document tranche to summarize for a privilege log. The tool reads a partner's email and writes a description: "Email reflecting legal advice regarding settlement strategy." Clean, defensible, exactly the kind of entry that survives a meet-and-confer. Except the underlying email was about a vendor contract, and somewhere in a long agentic run the model conflated it with the document three positions up the thread. The log entry is grammatical, plausible, and wrong — and the only person who would catch it is a reviewer reading the source email line by line, which is the work the tool was bought to eliminate. Multiply that by the 25 percent corruption rate and you have a privilege log with a couple hundred quietly fictional entries, produced under a certification, waiting for opposing counsel to find the one that matters.

And then the finding that should end a few sales presentations: agentic operation made things worse, not better. Letting the model use tools — the autonomous, multi-step workflows that every vendor demo has been pushing since Legalweek — added an average of 6 percent additional degradation.^[5] The "ready" bar in the study was a 98 percent reconstruction score. Across all nineteen models and all 52 domains, more than 80 percent of model-domain combinations failed to clear it. The single best performer, Gemini 3.1 Pro, was "ready" for eleven of the fifty-two domains.^[4] The only domain where almost every model held up was Python code — structured, verifiable, and about as far from a messy custodial email thread as a document gets.

Now put it in a discovery workflow

Here is why a Microsoft Research paper about editing crystallography files belongs on a litigation technology blog. Strip away the domains and DELEGATE-52 is measuring one thing: what happens to a document's integrity when you hand it to an AI and let the AI keep working on it. That is a description of half the eDiscovery roadmap for 2026.

The selling point of agentic review, AI redaction, automated privilege logging, and GenAI summarization is delegation — give the system the document set and let it run. DELEGATE-52 says the more you delegate, the more the documents drift, and the drift is invisible by design. Map the failure mode onto the EDRM stages and the exposure gets concrete fast.

Flowchart mapping AI corruption-risk entry points across EDRM stages, from collection and processing through AI-assisted review, privilege and redaction, and production

Figure 3: Every stage where an AI rewrites, re-renders, or re-tags a document is a stage where DELEGATE-52-style corruption can enter the record — and the later it enters, the closer it is to something a court will treat as evidence.

Workflow stage	What the AI is asked to do	What DELEGATE-52 predicts
GenAI review	Summarize, translate, classify, re-tag	Plausible misstatements that drive responsiveness and privilege calls
Privilege logging	Generate descriptions from document content	Confidently wrong descriptions that look defensible
AI redaction	Identify and apply redactions across a set	Missed or mislocated redactions in long automated runs
Production prep	Convert formats, apply Bates, re-render	Content altered during transformation, undetected
Agentic pipelines	Chain all of the above autonomously	Highest degradation — tool use adds ~6%

Table 2: The discovery tasks most aggressively marketed as "let the AI handle it" are the same tasks DELEGATE-52 flags as most exposed to silent corruption.

Redaction is the place this stops being theoretical, because legal technology already has a body count. AI redaction tools are genuinely good at the first pass — they scan a set in minutes and flag Social Security numbers, dates of birth, account numbers, the obvious PII.^[6] But the failures are spectacular when they come. A defect disclosed earlier this year showed that a major review platform's redaction tool, applied to Excel files with hyperlinks, left the "redacted" content fully readable to anyone who opened the native file in a plain text viewer — it only looked redacted in Excel.^[7] We have watched cosmetic redactions get reverse-engineered out of high-profile production sets, and watched a federal release of millions of pages go out with thousands of redaction failures. The lesson DELEGATE-52 adds is that the next generation of these failures will not look like failures. They will look like clean documents that happen to be wrong.

Fabrication was the loud crisis. This is the quiet one.

We have spent a year talking about AI hallucination in courtrooms — the fabricated case citations, the lawyers sanctioned for filing briefs full of decisions that never existed, the running tally that crossed twelve hundred cases. That problem is real, and it has a saving grace: a hallucinated citation is checkable. You pull the reporter, the case is not there, the lie collapses. Embarrassing, sanctionable, but catchable by anyone willing to look.

DELEGATE-52 describes the opposite failure, and it is the one that should worry a discovery team more. Hallucination invents something that is not there. Document corruption alters something that is. There is no external reporter to check a produced email against — the produced version is the record, unless you kept a pristine original and thought to compare. A fabricated citation announces itself the moment someone looks it up. A corrupted document hides inside a production set of two hundred thousand others, indistinguishable from the clean ones, until the day it surfaces in a deposition and nobody can explain why the version on the screen does not match the version in the witness's memory.

The fabrication crisis was loud because it failed in public, on the docket, in front of judges who knew how to spot a fake case. The corruption problem will be quiet precisely because it succeeds at looking real. The same fluency that makes these models useful is the thing that makes their mistakes hard to find — and in an evidentiary context, "hard to find" is a euphemism for "discovered too late."

The defensibility problem

Litigators have spent fifteen years building a vocabulary for trusting machines with documents. Recall and precision. F1 scores. Validation protocols. The TAR case law that ran from Da Silva Moore forward gave us a framework: you do not have to prove your process is perfect, you have to prove it is reasonable and defensible.^[8] That framework assumed a specific kind of machine — one that classified documents, sorted them into piles, but did not rewrite their contents. The document that went into TAR came out of TAR unchanged.

Generative AI broke that assumption, and almost nobody updated the framework. A model that summarizes, translates, redacts, or re-renders a document is operating on the content itself. DELEGATE-52 is the first rigorous measurement of how badly that operation can go, and it lands directly on questions a court actually asks.

Authenticity under Federal Rule of Evidence 901 assumes the thing you produced is what it claims to be. The 2017 amendments that gave us Rule 902(14) — self-authentication of electronic data through a hash value matching the original — were built on the premise that a file's digital fingerprint is a reliable proxy for its integrity. That premise holds beautifully right up until an AI rewrites the content, at which point the hash of the altered file is perfectly valid and perfectly useless, because it authenticates a document that no longer says what the source said. Chain of custody assumes you can account for what happened to a file between collection and production. When an AI silently alters content somewhere in that chain — and the alteration is, by the benchmark's own finding, designed to be plausible — you have a document whose provenance you can no longer fully vouch for, and a hash that will swear the corrupted version is the real one. The conversion problem that forensic examiners have warned about for years, where flattening a native file to PDF strips metadata and breaks the chain, now has a generative cousin: the substance of the document itself drifts, while the metadata sits there looking untouched.^[9]

This is why the experienced people in the room keep saying the same unglamorous thing. The human-in-the-loop step — an attorney or senior reviewer confirming AI flags before redactions are finalized, the audit trail logging what was changed, by whom, on what basis, at what time — is not optional for defensibility, and it is exactly the step the agentic sales pitch wants to automate away.^[6] EDRM and other industry bodies are expected to produce GenAI benchmarking standards in the next year, the way they eventually did for TAR.^[10] DELEGATE-52 is a preview of what those standards will have to measure, and the early numbers are not flattering.

The model's competence is the problem. A tool that failed obviously would be safe. A tool that fails plausibly, sparsely, and silently is the one that ends up in a sanctions motion.

What a careful shop actually does

None of this is an argument against AI in discovery. The recall numbers on GenAI review are real, the cost compression is real, and a small firm that finally got priced into a corpus it could never have reviewed by hand is not going to give that back because a benchmark spooked me. The argument is for treating these tools as what the benchmark proves they are: powerful, fast, and unreliable in a specific, measurable way that you can engineer around.

A few things follow directly from the data.

Minimize the round trips. The corruption compounds with the number of delegated interactions. A workflow that hands a document to a model once and validates the output is in far better shape than one that lets an agent iterate on the same file twenty times. When you architect a pipeline, count the interactions the way you would count GB through processing.

Keep the original immutable and hash it. The single best defense against silent content drift is a pristine, hashed copy of the source that never touches the model. If you can compare the produced version against an untouched original — programmatically, not by eye — you turn a silent failure into a loud one. This is old forensic discipline, and DELEGATE-52 just made it non-negotiable for any AI-touched production.

Scope the AI to retrieval, not mutation, wherever you can. A model that finds the responsive documents and points at the privileged passages is doing classification — the thing the TAR framework already knows how to defend. A model that rewrites, redacts, or re-renders the content is doing the thing the benchmark says it does badly. Keep the AI on the search side of the line as long as the matter allows.

Demand the audit trail, and read it. Every redaction, every reclassification, every transformation should produce a log. The agentic systems that degrade the most are also the ones whose vendors are least eager to surface what happened inside the loop. If a tool cannot tell you what it changed and why, it is not ready for an evidentiary workflow, no matter how good the demo looked.

Validate on a sample, every matter, every time. Not once during procurement. A human-reviewed sample of AI-touched documents, checked against originals, is the only thing standing between you and a 25 percent corruption rate you cannot see.

Mindmap of countermeasures against AI document drift, grouped into architecture, integrity, defensibility, and procurement

Figure 4: The countermeasures DELEGATE-52 implies are mostly old forensic discipline applied to a new failure mode — keep an untouched original, limit delegation, and verify against the source rather than against your last clean run.

The direction of travel is the worry

The uncomfortable part is that everything pushing the market is pushing toward more delegation, not less. The economics make sure of it. As law firm AI adoption deepens from a few thousand power users into genuine agentic, multi-step workflows, token costs stop being an invisible line bundled into a platform fee and start becoming a visible, metered expense — and the response, predictably, will be to let the agent do more on its own to justify the spend.^[11] The vendor incentive runs the same direction: autonomy demos better than supervision, and "you still have to check its work" has never sold a seat.

That is the gap DELEGATE-52 exposes, and it is the same gap I keep running into across every one of these stories. The benchmark says competence and reliability are different things, and the market is pricing them as if they were identical. The firm with the budget for a second human pass over every AI-touched production will catch the silent 25 percent. The plaintiff's shop running lean, that finally got AI review because it was the only way to afford the matter at all, is the one most likely to produce a corrupted record and never know — because the corruption was designed, by frontier-model competence, to look exactly like clean work.

So check the work. Hash the originals. Count the round trips. Treat the confident, fluent, helpful machine as the most dangerous kind of unreliable narrator, because the benchmark just put a number on exactly how unreliable it is. I read every line of this post before it went out. Your production set has a stronger claim on that kind of attention than a blog does.