Incrementality

Holdout leakage and suppression QA checklist

Published July 3, 2026. Updated July 8, 2026. Status: evergreen source page.

A holdout is only useful if it remains meaningfully unexposed. When control users, households, stores, or markets receive the tested treatment through another path, the readout may still look scientific while the contrast quietly weakens.

Use this checklist before launch, during delivery QA, and before the final readout. It is written for campaign teams, analysts, publishers, agencies, and buyers who need to know whether an incrementality result still represents the question it was designed to answer.

Picture a renewal meeting where the lift report says the campaign produced an 8 percent gain. Before that number goes into the recommendation deck, the team has to answer a simpler question: did the holdout actually stay held out? If control accounts received retargeting, email, partner delivery, or different outcome matching, the result may be positive without being clean enough for a causal claim.

Holdout suppression review desk with separate treatment and control packets, checklist records, and warning markers on the control packet. — A holdout QA review starts by separating treatment evidence from control evidence before anyone debates the lift percentage.

Where leakage enters

Leakage is not one failure. It is any route that lets the control cell receive the treatment, a close substitute for the treatment, or a different observation rule than the exposed cell.

Plainly, leakage means the comparison group is no longer a clean picture of what would have happened without the campaign. The leak can happen because the wrong IDs were suppressed, because another channel reached the same people, or because outcomes were easier to see in one cell than the other.

Suppression boundary model showing an audience split into treatment and holdout lanes with amber leakage arrows crossing into the protected holdout lane. — The assignment split is only the starting point. Delivery systems, partner audiences, and observation rules can still cross the holdout boundary.

Leakage path	What to inspect	Why it changes the readout
Audience suppression miss	Control IDs, hashed records, device IDs, household IDs, CRM records, or market lists were not excluded from buying, activation, or delivery systems.	The control group receives some tested exposure, shrinking or distorting the measured treatment difference.
Identity graph mismatch	The assigned unit differs from the delivery unit: household assignment but device delivery, account assignment but cookie delivery, or market assignment with cross-border reach.	Exposure can cross the boundary even when the campaign system appears to obey the holdout rule.
Retargeting or audience extension	Control users enter downstream retargeting pools, lookalike expansion, partner audiences, reseller delivery, or sequential messaging.	A clean first exposure rule can be broken by later tactics that were not part of the original test plan.
Other media or sales outreach	Email, search, affiliate, social, field sales, promotions, or account outreach reaches the control cell during the test window.	The result may compare two different marketing mixes rather than treatment versus business as usual.
Outcome observation mismatch	Treatment and control records have different match rates, lead routing, store reporting, survey recruitment, CRM coverage, or conversion windows.	The measured lift may reflect which cell was easier to observe rather than which cell changed behavior.
Post-assignment filtering	Inactive users, unmatched records, low-delivery markets, stockout locations, or outlier accounts are removed after early results are visible.	The comparison can become selected even if the original assignment was randomized.

Pre-launch suppression brief

The suppression brief should be short enough to use, but specific enough that ad operations, analytics, and the buyer can audit the same boundary later.

Field	Write before launch	Evidence to retain
Assigned unit	User, household, account, store, market, ZIP code, device, or another unit, with the reason that unit can be separated.	Assignment file, timestamp, eligibility rule, and owner.
Suppression surface	Every system that must exclude the control cell: ad server, buying platform, CRM, clean room, email tool, sales list, retail platform, or partner activation.	Suppression upload logs, platform receipts, audience counts, and campaign screenshots or exports.
Exposure definition	The treatment that must be blocked from control: specific campaign, creative theme, offer, media channel, package, sales touch, or discount.	Campaign IDs, deal keys, creative IDs, offer IDs, and active dates.
Allowed background activity	Business-as-usual messages or evergreen media that control units may still receive.	Written exception list and reason each exception does not invalidate the decision.
Leakage tolerance	The maximum acceptable control exposure rate, affected market share, or outcome-observation imbalance before the readout is downgraded.	Pre-stated threshold, escalation owner, and readout language rule.
QA cadence	Launch-day check, early delivery check, weekly or mid-flight check, and pre-readout audit.	Dated QA notes with counts, differences, unresolved risks, and owner decisions.

In-flight QA checks

In-flight QA should be timed before the readout is persuasive. A launch-day check catches missing suppression uploads; a mid-flight check catches audience drift; a pre-readout check decides whether the final language needs to be narrowed.

Launch-day count reconciliation

Compare eligible treatment and control counts against the activation counts in each platform. A sudden drop in one cell, a mismatched geography count, or a missing suppression receipt should be resolved before delivery ramps.

Control exposure audit

Report impressions, reach, clicks, emails, offers, sales touches, or other treatment events observed in the control cell. The audit should use the same unit as assignment whenever possible.

Overlap with other tactics

Check whether control units are entering retargeting pools, search remarketing lists, lead-nurture flows, loyalty offers, partner audiences, or regional promotions during the test window.

Delivery imbalance

Treatment delivery should be checked against the plan by geography, device, audience, placement, creative, and date. A test with weak treatment exposure may be inconclusive for operational reasons.

Observation parity

Compare match rates, survey recruitment rates, form processing, lead routing, store coverage, CRM sync timing, and outcome lag across cells. Unequal observation can create or hide lift.

In-flight holdout QA workflow with count, exposure, overlap, and observation checks leading to pass, revise, or hold decision trays. — In-flight QA turns scattered operating checks into a decision path: pass the readout, revise the setup, or hold causal language.

Leakage severity score

The goal is not to pretend every test can be perfectly isolated. The goal is to decide whether the readout still deserves causal language or should be downgraded to operational evidence.

Severity	Pattern	How to use the result
Low	Minor control exposure, symmetric outcome capture, documented cause, and no material change to the treatment-control contrast.	Keep causal language bounded to the tested population, but disclose the leakage check and sensitivity read.
Moderate	Noticeable control exposure, partial suppression failure, or a material overlap with another tactic that can be estimated.	Use directional language, show sensitivity ranges, and avoid confident scaling claims until a cleaner test confirms the result.
High	Control group received substantial treatment or a close substitute, or one cell had meaningfully different outcome capture.	Do not use the estimate as causal evidence. Treat the report as delivery and operations learning.
Unknown	Suppression logs, exposure audits, or observation parity checks are missing.	Downgrade the conclusion because the control condition cannot be verified.

Response map when leakage appears

The response should be chosen before the result is visible whenever possible. That prevents teams from treating a positive result as clean just because the business would like to use it.

Leakage signal	Decision rule	Reader-first next action
Control exposure exists, but remains below the pre-stated tolerance and does not change sensitivity reads.	Keep the result, but disclose the check and keep the claim inside the tested population and window.	Add a short leakage note to the readout and keep the exact campaign IDs, dates, and suppression receipts attached.
Suppression failed in one activation surface, but the affected unit count is measurable.	Re-run the estimate with the affected control units removed or flagged, then compare both versions before deciding.	Show the original result, the sensitivity result, and the operational fix needed before repeating the tactic.
Another marketing tactic reached a material share of control units during the test window.	Downgrade from causal language to directional evidence unless the overlap can be isolated convincingly.	Name the overlapping tactic, show its timing, and avoid attributing all lift to the tested campaign.
Outcome capture is different across cells: match rates, survey response, lead routing, or store coverage diverge.	Do not treat the result as behavioral lift until observation parity is repaired or bounded with a sensitivity read.	Separate measurement-system learning from campaign-effect learning in the recommendation.
Suppression logs or exposure audits are missing.	Classify the readout as an unverified comparison, even if the original test plan looked randomized.	Use the report for operations and planning, not as proof that the campaign caused the outcome.

Readout language by QA result

QA finding	Careful wording	Wording to avoid
Suppression clean, exposure strong, outcome capture balanced.	The test estimates incremental impact for this eligible population, treatment, and readout window.	The channel caused the same lift everywhere.
Small control leakage, sensitivity still clears the decision threshold.	The result remains supportive under the documented leakage sensitivity, but should stay bounded to the tested setup.	Leakage was immaterial, so the number is exact.
Moderate leakage or overlapping tactics.	The result is directional because the holdout was partially exposed or affected by related activity.	The positive lift proves the treatment worked.
Outcome observation differs by cell.	The readout cannot separate behavioral lift from differences in matching, routing, survey response, or data capture.	The measured conversions prove incremental impact.
Suppression evidence missing.	The report describes observed performance under an unverified comparison, not a clean causal estimate.	The test design guarantees incrementality.

Side-by-side readout boards contrasting an overclaimed lift result with leakage warnings against a bounded recommendation supported by suppression evidence. — When leakage is material or unverified, the readout should shrink the claim instead of using a scientific-looking number to overstate certainty.

Questions for the readout call

Which systems received the suppression list, and can each system show a dated upload, audience count, or exclusion log?
Was the assignment unit the same as the delivery and outcome unit? If not, what cross-unit leakage was expected?
How many control units received impressions, emails, offers, calls, retargeting, or related exposure during the window?
Were any users, markets, accounts, stores, leads, or outcomes removed after assignment?
Did treatment and control have comparable match rates, survey recruitment, lead routing, store coverage, and conversion lag?
What leakage threshold would have downgraded the conclusion before the result was visible?

Takeaway

A holdout is not protected by intention. It is protected by visible suppression, exposure audits, stable assignment, and equal outcome capture. When those checks are present, a lift result can support a bounded decision. When they are absent, the most honest readout may be that the campaign delivered, but the counterfactual was not clean enough to prove what changed.

Keep reading

Choose the next guide

After checking leakage, move into the test plan, lift readout, or final claim language so the conclusion reflects the quality of the control.

Test planLock leakage rules earlyDefine assignment, suppression surfaces, tolerance thresholds, downgrade rules, and readout language before results are visible. Readout QAReview the lift reportCheck assignment integrity, control exposure, outcome quality, uncertainty, segment cuts, and decision wording after results arrive. Claim languageDowngrade overreachTurn clean, partial, directional, or unverified evidence into language that stays inside what the comparison can support.