The 16 Percent Problem: How Collection Outgrew eDiscovery

Collection spending grows 16% CAGR through 2030, four times review's growth rate. The fastest-growing slice of eDiscovery has the fewest vendors selling it.

By Claude and Gemini with Sid Newby | May 2026

Walk through the vendor floor at any 2026 legal tech conference and count the booths advertising AI document review. Then count the booths advertising AI collection. The first list is twenty deep. The second one is empty. That gap, currently invisible to most buyers, is the central story of the eDiscovery market for the rest of the decade.

ComplexDiscovery published a quiet market intelligence report on May 3 that maps the shift in numbers no vendor's pitch deck wants to highlight.^[1] Between 2025 and 2030, eDiscovery collection spending grows from $3.33 billion to $7.02 billion, a compound annual growth rate of roughly 16 percent. Review spending, the part the vendors love to talk about, grows from $12.16 billion to $14.60 billion over the same window, a CAGR closer to 4 percent. The total eDiscovery market expands from $19.61 billion to $28.08 billion, but the composition of that market quietly inverts. Review's share collapses from 62 percent to 52 percent. Collection's share rises from 17 percent to 25 percent. The fastest-growing line item in eDiscovery is the one nobody is selling.^[1]

The structural reason is cleanly stated. AI compressed the price of review faster than data growth could offset it. Collection had no comparable AI revolution. The work itself (figuring out what data lives where, getting it out of the platform that owns it, validating that the export contains what it should) got harder, not cheaper, because the platforms got more numerous and more complex. The result is the pricing inversion that ComplexDiscovery's report makes visible. The result is also a vendor stack that has spent five years optimizing the shrinking part of the market while the growing part of the market goes unsold and unsolved.

The Numbers Nobody Highlighted

The headline numbers from the May 3 report deserve to be read carefully because the implications travel further than the analyst commentary suggests.

Task	2025 spend	2030 spend	CAGR	2025 share	2030 share
Review	$12.16B	$14.60B	~4%	62%	52%
Processing	$4.12B	$6.46B	~9%	21%	23%
Collection	$3.33B	$7.02B	~16%	17%	25%
Total eDiscovery	$19.61B	$28.08B	~7%	100%	100%

Table 1: eDiscovery task composition shift, 2025 to 2030. Source: ComplexDiscovery, "Market Intelligence: The eDiscovery Task Composition Shift from 2025 to 2030" (May 3, 2026).^[1]

Three structural facts fall out of that table.

First, the absolute size of review barely grows. Twelve billion dollars to fourteen billion dollars over five years is a compound rate that the rest of the legal services economy would call stagnation. AI made review cheaper per document at exactly the rate that data volumes made review more voluminous, and the two effects roughly cancel. The market for review is essentially treading water in dollar terms, even as the underlying work continues to expand on a unit basis.

Second, collection more than doubles. $3.3B to $7.0B is the kind of curve that attracts venture capital and triggers acquisition activity. It is the curve Relativity, Everlaw, and DISCO have spent the last 15 years building infrastructure for in the wrong place. The investment dollars went into review platforms — that is where the recurring revenue lives, that is what the AmLaw 200 buy. The growth dollars are migrating to collection, where almost none of the cloud-native review platforms have built native capability.

Third, processing grows faster than review but slower than collection. Processing is the middle layer — getting collected data into a state where review can begin. The 9 percent CAGR reflects mostly volume pressure offsetting modest pricing efficiencies. Processing is the canary in the coal mine for collection: when collection volumes grow, processing volumes follow, with a lag.

Pie chart of 2025 eDiscovery spend by task, with review at 62 percent processing at 21 percent and collection at 17 percent

Pie chart of projected 2030 eDiscovery spend by task, with review declining to 52 percent processing growing to 23 percent and collection rising to 25 percent

Figure 1: The composition inversion. Review's share of total eDiscovery spend declines from 62 percent to 52 percent. Collection's share grows from 17 percent to 25 percent. The fastest-growing slice of the pie is the one with the fewest vendors competing for it, and the smallest amount of analyst attention.

The report's methodology (reconciled estimates aligned to common scope, drawing from publicly available research and vendor disclosures) is conservative. ComplexDiscovery has been doing this market sizing exercise for over a decade and Rob Robinson's reconciliation work is the closest thing the industry has to a neutral baseline.^[2] Other 2026 forecasts, including Fortune Business Insights and the SaaS eDiscovery Enablement Market reports, project roughly similar shapes with somewhat higher absolute numbers but the same underlying dynamic: collection growing faster than review.^[3] The shape is robust across forecasters even where the level is contested.

Why Review Got Cheap

The review-side story is the one that has been told already, in this blog and elsewhere. The short version is that 2024 and 2025 produced a one-way ratchet on per-document review pricing.

Traditional contract review through staffing agencies cost $1 to $3 per document for first-pass responsiveness review and $4 to $8 per document for privilege review.^[4] On a 250,000-document matter, a small commercial dispute by 2026 standards, that math comes out to between $250,000 and $750,000 in raw review labor before anyone touches the platform fees. AI-assisted review, by the Winter 2026 eDiscovery Pricing Survey's measurement, has converged in the $0.11 to $0.50 per-document range — roughly one-tenth the historical cost.^[4]

Relativity made the picture more dramatic in October 2025 when it announced that its aiR for Review and aiR for Privilege GenAI products would be included in the standard RelativityOne package starting early 2026, at no incremental cost.^[5] Everlaw matched the move with EverlawAI. DISCO collapsed its agentic review tooling into its single per-GB price. The vendors zeroed review pricing out as a competitive differentiator entirely. Compression turned into deletion. Review is no longer something you buy. It is something the platform does.

That is what AI compression looks like when it actually arrives in a market. The unit cost falls so far that the line item collapses into platform overhead. The 4 percent CAGR on review spend in the ComplexDiscovery forecast is the residue of a much sharper compression on per-document pricing offset by continued growth in document volumes. If the vendors had not zeroed out the AI premium, review's share of the market would not be declining gracefully — it would be cratering.

But the absolute review market is essentially flat in dollar terms. Whatever growth comes from new matter starts, ESI volume increases, and the long tail of platform-and-services bundling that did not get AI-zeroed. That stagnation, on its own, would not be remarkable. What makes it the story of 2026 is that the spend going to review is not disappearing from the eDiscovery economy. It is migrating upstream to collection, where the work has gotten dramatically harder.

Why Collection Got Expensive

There is no AI productivity miracle for collection. Collection is the work of identifying custodians, scoping data sources, accessing the systems where the data lives, executing a defensible export, validating the export, and documenting chain of custody. That work has been getting harder for five years for reasons that have nothing to do with whether anyone wants to apply machine learning to it.

The reasons are structural, and they fan out in three directions.

Cloud platform fragmentation. The 2018 enterprise data environment had two or three places where evidence lived: Exchange, file shares, maybe SharePoint. The 2026 enterprise data environment has dozens. Microsoft 365 alone now spans Exchange, OneDrive, SharePoint, Teams, Yammer, Loop, Forms, Whiteboard, Stream, Planner, and the Substrate. Google Workspace adds Drive, Gmail, Chat, Meet, and Sites. Slack, Zoom, Notion, Salesforce, Asana, Jira, Confluence, Box, Dropbox, and the entire SaaS sprawl each carry data that may be discoverable. Each platform has its own export API, its own retention model, its own permissions inheritance, its own version history, and its own quirks that the collection vendor has to know cold.^[6] An average mid-sized enterprise in 2026 has 50 to 80 SaaS applications under contract. The collection problem scales with the number of platforms, not the number of custodians.

Hyperlinked files and modern attachments. This is the one we covered in detail last post. The short version: cloud attachments (the default sharing mechanism in modern Outlook, Gmail, Teams, and Slack) break the email family by storing the linked file separately from the email that references it. Collecting them defensibly requires either Metaspike's Forensic Email Collector at roughly $200K-$500K per matter in labor costs, or pre-litigation Microsoft Purview retention label configuration, or a written disclosure that current versions are being produced where contemporaneous versions are not technologically feasible.^[7] None of these options existed as line items in collection budgets five years ago. All of them now do.

AI-generated ESI. The Heppner ruling in late 2025 established that Claude chat logs are not privileged.^[8] That ruling — combined with the broader proliferation of AI meeting tools, copilots, and prompt-and-response logs — has expanded the universe of discoverable data into a category most enterprises do not know how to collect. Microsoft Copilot generates SubstrateHolds, .loop files, and fragmented SharePoint containers that traditional collection tools were never designed to capture. AI meeting transcripts from Zoom, Teams, Otter, Fathom, and Read live in the application's storage system and require platform-specific connectors. None of this is even on the per-GB pricing schedules most vendors quote.

Flowchart comparing the 2018 enterprise data environment's three sources to the 2026 environment's combinatorial sprawl across Microsoft 365 surfaces Google Workspace SaaS applications and AI generated ESI

Figure 2: Why collection costs are growing 4x faster than review. The 2018 data environment fit on one ESI protocol page. The 2026 environment fans out across cloud platforms, SaaS sprawl, and AI-generated ESI categories that did not exist five years ago. Every new platform multiplies the collection scope; every new platform adds API quirks, retention rules, and authentication overhead that the collection vendor has to know cold.

The math compounds because the work compounds. A 2025 commercial litigation matter routinely involves data from email, three to five SaaS platforms, a chat tool, a document management system, and increasingly an AI assistant or two. Each of those sources has its own collection workflow, its own custodial scoping, its own validation requirements. Forensic specialists who can do this work end-to-end charge $300 to $600 an hour. The cost of getting data into a defensible state has roughly tripled since 2018, even before considering hyperlinked files or AI-generated ESI.

The vendor response to all of this has been remarkably tepid. RelativityOne shipped what it calls "SaaS-native enablement with zero-ETL data ingestion" in late 2025, advertising that law firms can process 1TB+ datasets directly from Microsoft 365 and Slack with processing costs dropped 50 percent versus legacy pipelines.^[9] That is real progress on the processing side — the bridge between collection and review — but it does not solve the upstream collection problem. Pulling the right 1TB out of a Microsoft 365 tenant in the first place still requires forensic specialists, Purview policy work, custom API tooling, and the disclosure-and-defensibility apparatus that nobody has automated.

What Collection Should Cost

The pricing structure for collection in 2026 is dominated by labor, not software. The standard quotes from the Winter 2026 eDiscovery Pricing Survey put hosting in the $5 to $15 per GB per month range and processing in the $3 to $10 per GB range, both compressed by aggressive vendor competition.^[4] Collection is rarely sold per GB at all. It is sold by the matter, by the custodian, or by the hour — because the work itself is heterogeneous, custodian-specific, and platform-specific in ways that GB-based pricing cannot capture cleanly.

A representative 2026 collection budget for a mid-sized commercial matter looks something like this:

Cost component	Range	Notes
Custodian interviews and scoping	$5K-$25K	Typically 5-25 custodians at $1K-$3K each
Microsoft 365 collection (Purview-driven)	$15K-$75K	Heavily dependent on whether retention labels are in place pre-hold
Google Workspace collection (Vault)	$10K-$40K	Vault transitions and contemporaneous-version pulls drive cost variance
Slack/Teams collection	$10K-$50K	Per-channel scoping; modern attachment re-pull adds 30-50%
SaaS app collection (Salesforce, Box, Notion, etc.)	$5K-$30K per app	Custom connectors required for most; varies wildly
Hyperlinked file / FEC re-collection	$20K-$200K	Optional; depends on ESI protocol
Forensic specialist hours (validation, chain of custody, deposition prep)	$30K-$150K	$300-$600/hour, scope-dependent
Total mid-sized matter collection	$95K-$570K

Table 2: Representative collection budget for a mid-sized commercial matter, 100,000 to 500,000 document range, mixed cloud/SaaS data sources. Practitioner estimates assembled from Winter 2026 eDiscovery Pricing Survey ranges and Microsoft 365 / Google Workspace published billing schedules.^[4]^[10]^[11]

The variance ($95K to $570K for a single matter) tells you most of what you need to know. Collection is not a commoditized service. The price depends entirely on which platforms are involved, what retention infrastructure was configured before litigation began, and how aggressively the producing party chooses to pursue contemporaneous-version preservation. The same volume of data can cost six times as much to collect from one organization as from another, depending entirely on which buttons were pressed in the M365 admin console two years before the lawsuit was filed.

That variance is what is driving the 16 percent CAGR. Most enterprises are at the top end of that range because they have not done the upstream work. Most plaintiffs cannot afford to demand the top-end methodology and end up accepting the bottom-end production with a written disclosure. The producing party defaults to the lowest-cost defensible option that survives proportionality review, and the gap between that option and the technologically possible option grows wider every year.

Where the Money Is Actually Going

The growth in collection spending in 2026 is not flowing to scrappy startups solving the problem from scratch. It is flowing through three categories of incumbent.

Forensic services firms like Consilio, Lighthouse, HaystackID, Epiq, and the litigation support arms of the Big 4 are absorbing most of the collection labor spend.^[12] Their economics are people-driven, not software-driven, and the rising labor cost of collecting from increasingly fragmented data environments translates directly to revenue growth. None of them have publicly disclosed margins on collection-specific work, but the absorption pattern — review revenue plateauing while collection-and-managed-services revenue grows — is consistent with the ComplexDiscovery composition shift.

Cloud platform vendors themselves capture a meaningful share through native eDiscovery features. Microsoft Purview eDiscovery (Premium) charges per-GB storage and pay-as-you-go billing for non-M365 data, with a $15/GB fee for exceeding seeded capacity.^[11] Google Workspace Vault is bundled with Business and Enterprise plans, but Workspace customers report spending 20 to 35 percent of annual Workspace cost on third-party eDiscovery integrations like Decipher and Venio Systems.^[10] The platform vendors are not in the collection-services business directly, but they are quietly monetizing the collection problem through tier upselling and storage fees.

Specialty collection tooling vendors — Metaspike, Cellebrite, Magnet Forensics on the device side, plus the SaaS connector vendors like CloudNine and CloudExtract — have a smaller share but higher-margin slice. Metaspike's Forensic Email Collector at $699 per machine per year is the canonical example: trivial software cost, dominant labor cost, captured by the firms running the tool.^[13]

What is conspicuously absent from this list is a clean, scaled, AI-native collection vendor. The companies that built the AI document review revolution — DISCO, Everlaw, Reveal, the in-house aiR teams at Relativity — are not the companies leading on collection. The collection problem is a systems-and-services problem, not a model-and-tooling problem. It requires deep integration with two dozen platform APIs, navigating retention and permissions models that change quarterly, and managing the validation overhead that machine learning cannot meaningfully reduce. That is not the kind of problem that yields cleanly to the AI playbook, which is why none of the headline AI vendors are particularly interested in it.

What This Means for the Rest of the Decade

The composition shift that ComplexDiscovery mapped is the kind of structural change that quietly determines who wins and loses in a market. Three predictions, ranked by confidence:

The vendor consolidation pattern is going to continue, but the targets shift. The 2024-2025 wave of acquisitions concentrated on review-platform and AI-tooling targets — HaystackID buying eDiscovery AI, Relativity bulking up its aiR engineering team, DISCO acquiring its way into the agentic AI category. The 2026-2028 wave will increasingly target collection capability. The forensic services firms, the SaaS connector vendors, and the platform-specific collection specialists are the next acquisition class. Watch for the larger MSPs to start buying their way into the SaaS-app coverage they currently outsource.

The pricing model changes. Per-GB pricing made sense when 90 percent of eDiscovery data fit in a few well-understood platforms. It makes less sense when the actual variance in collection cost depends on which specific platforms are involved and how each was configured. Expect more vendors to move toward per-platform, per-custodian, or per-hour pricing for collection — even as review pricing converges on bundled platform fees. The two ends of the workflow are going to be priced on different logics, with predictable transparency disasters in the middle.

The defensibility floor rises faster than the budget. The most expensive part of collection in 2026 is documenting what was not collected and why. Hyperlinked files that could not be re-pulled. Slack channels where a custodian's permissions were revoked. AI meeting transcripts that turned out to be hosted by a vendor outside the producing party's control. Each of these gets disclosed in a cover letter, justified under proportionality, and increasingly second-guessed by the receiving party in deposition or motion practice. The overhead of documenting collection is becoming a material line item independent of the collection itself, and that overhead does not respond to AI compression.

The longer-term question is whether the collection problem produces a true vendor revolution or remains a labor arbitrage exercise. The pattern in adjacent technology markets — observability, identity, security operations — has been that fragmented, services-heavy categories eventually consolidate around a small number of platform-native solutions. eDiscovery review went through that consolidation between 2010 and 2020. eDiscovery collection has not, and the structural reasons (platform diversity, retention complexity, defensibility overhead) suggest it may not go through it on the same timeline. Collection may stay services-heavy and labor-intensive for the rest of the decade.

For the litigation teams who actually have to do this work, the operational implications are immediate. The 2026 collection budget is not a 17 percent share of the total. It is a 25 percent share that arrives ahead of schedule because the underlying data environment has already moved. The matters that get planned with 2024-vintage cost assumptions are the matters that overrun. The enterprises that have not done the upstream Purview, Vault, and SaaS-connector work are the enterprises that pay the top of the cost variance every time a hold triggers.

The vendor floor at every conference will keep advertising AI document review for another five years. The actual money will be flowing somewhere else. That gap is where most of the bad budgeting decisions in legal tech are going to be made between now and 2030.