Incrementality

Comparison market and holdout planning guide

Published July 3, 2026. Updated July 3, 2026. Status: evergreen source page.

Most advertising measurement claims depend on one quiet question: compared with what? A campaign can look persuasive when the comparison group was easier to convert, exposed through another route, already trending differently, or chosen after the result was visible.

This guide helps a team choose the right comparison before launch, protect it during the campaign, and write a readout that does not claim more than the design can support.

Planning-board illustration showing matched markets, a protected holdout, similar pre-period trends, campaign start timing, and leakage checks before a lift claim. — A comparison design should pass three checks before the readout uses lift language: pre-period balance, exposure separation, and outcome-window discipline. A visible failure in any lane changes the strongest allowable claim.

Choose the comparison type

Design	Use when	Main risk	Minimum protection
User or household holdout	Exposure can be suppressed for eligible users, households, accounts, or devices.	Identity gaps, cross-device exposure, or other campaigns reaching the control group.	Fixed eligibility, suppression audit, exposure leakage check, and one primary outcome window.
Geo or store holdout	Media, retail activity, pricing, or operations can be changed by market, store, or region.	Markets differ in trend, seasonality, distribution, or competitor pressure.	Pre-period trend checks, matched controls, logged local events, and market-level readout.
Matched-market comparison	Random assignment is not practical, but untreated markets can be chosen before launch.	Convenient controls that match on size while missing the business trend.	Matching locked before results, placebo period checks, and sensitivity analysis.
Business-as-usual benchmark	No clean holdout exists and the decision can tolerate weaker evidence.	Seasonality, pricing, distribution, or demand changes masquerade as campaign impact.	Plain language that labels the result directional rather than causal.

The pre-launch planning sheet

The planning sheet should be written before the campaign team can see the answer. It does not need to be long. It needs to remove avoidable judgment calls from the readout.

Decision

Name the action the result will inform: renew a package, scale spend, change audience, adjust bids, approve a market rollout, or pause the tactic.

Eligible population

Define who can enter treatment or control before exposure begins. Include geography, customer status, product availability, audience rules, and any exclusions.

Comparison rule

Explain how the control group, holdout, or matched market is chosen. The rule should not depend on which option later produces the strongest lift.

Primary outcome

Choose one outcome with a source of truth, reporting lag, and fixed readout window. Keep secondary metrics useful but clearly subordinate.

Downgrade triggers

List events that would weaken the claim: suppression failure, budget underdelivery, tracking changes, stockouts, price changes, overlapping launches, or large unplanned promotions.

Score the comparison before launch

A fast comparison score keeps the team from discovering the evidence standard only after the readout is persuasive. It is not a statistical model. It is a pre-launch decision rule for deciding whether the campaign can support lift language, directional language, or only operational reporting.

Factor	Green-light condition	Directional-only condition	Downgrade trigger
Assignment rule	Treatment and control are assigned or matched before launch and cannot be swapped after results are visible.	Controls are pre-selected, but the rule depends on judgment rather than randomization.	Controls are replaced, trimmed, or explained away after the first result is known.
Pre-period trend	Treatment and control move similarly across the same business cycle before launch.	Levels match, but one side is already accelerating or decelerating.	The campaign group was already improving faster before media changed.
Exposure separation	Control units have a documented suppression path and a leakage audit.	Some spillover is possible, but the team can estimate where it happened.	Control users, stores, or markets receive meaningful treatment through another route.
Outcome source	The primary outcome has a stable definition, source of truth, and reporting lag.	The outcome is useful but depends on matching, modeled status, or delayed closeout.	The outcome definition changes, excludes weak rows, or arrives from a different system mid-test.
Operating context	Price, distribution, sales coverage, inventory, and local events are logged and materially similar.	One side has known changes that can be disclosed and bounded.	A major promotion, stockout, sales push, or local event explains the observed movement.

If two or more factors are directional-only, write the test plan as a monitored comparison unless the team can redesign assignment, extend the pre-period, or narrow the decision. If any downgrade trigger is likely before launch, the strongest honest result is usually a learning readout, not a causal lift claim.

Balance checks that matter

Check	What to compare	Why it matters
Starting level	Revenue, conversions, traffic, account base, distribution, store count, or eligible customers before launch.	A much larger or smaller control can make normal movement look like lift.
Pre-period trend	Daily or weekly movement before treatment, including seasonality and promotion cycles.	A treatment group already accelerating before launch is not a neutral comparison.
Outcome volatility	Historic swings, outlier weeks, small-market instability, and high-value transactions.	A noisy comparison can produce confident-looking but fragile readouts.
Business drivers	Price, inventory, retail distribution, sales coverage, product launches, local events, and competitor pressure.	Untracked operating changes can become hidden causes in the readout.
Exposure separation	Suppression logs, geography boundaries, audience extension, frequency overlap, and partner delivery.	A control group that receives the treatment no longer estimates what would have happened without it.

Worked example: when to downgrade the claim

A regional retailer wants to test extra media in eight treatment markets and compare them with eight matched markets. The markets are similar in starting revenue and store count, but the treatment markets had a distribution expansion two weeks before launch and were already trending upward. The plan can still teach the team something, but it should not be sold as a clean causal test unless the design is repaired before launch.

Finding before launch	Why it weakens the readout	Better move	Allowed language if unchanged
Treatment markets already outpaced controls in the pre-period.	The post-period difference may extend an existing trend rather than measure media impact.	Rematch on trend, add more pre-period weeks, or choose a different treatment set.	Observed markets outperformed selected controls during the campaign window.
One side received new product distribution before the campaign.	Availability can create revenue movement without any advertising effect.	Exclude affected markets, wait for distribution to stabilize, or make distribution a documented covariate.	The result is directional because operations changed near launch.
Control markets may receive reseller, social, or national media exposure.	Leakage compresses the difference and makes the control group a partial treatment group.	Audit delivery logs by market and write a leakage threshold before launch.	The comparison estimates performance under partial spillover, not a clean no-media counterfactual.
The readout goal is a national budget decision.	Eight markets may not represent national seasonality, distribution, or competitive pressure.	Frame the result as a market-level learning test and plan a broader follow-up if the signal holds.	This market set supports a next-test decision, not national proof.

Protect the control during the campaign

Keep treatment and control eligibility fixed unless a pre-stated rule says otherwise.
Log delivery by the same units used for assignment: user, household, store, market, account, or region.
Check that control units were not reached by audience extension, reseller media, retargeting, sales outreach, or overlapping promotions.
Record budget underdelivery, pacing changes, creative substitutions, tracking outages, product availability issues, and price changes as they happen.
Do not replace weak-looking controls after the campaign starts unless the readout is explicitly downgraded.

Readout language by evidence strength

Design condition	Stronger language	Language to avoid
Clean randomized holdout, limited leakage, clear outcome window.	The test estimates incremental effect for this eligible population and campaign setup.	The channel always drives this lift.
Matched markets with strong pre-period trend similarity and logged controls.	The tested markets outperformed a pre-selected comparison estimate under these conditions.	The campaign proved national incrementality.
Control contamination or major operating changes.	The result is directional because the comparison no longer cleanly represents business as usual.	The positive result confirms the tactic.
No holdout, only before-and-after reporting.	The report shows observed movement after launch, not a causal estimate of lift.	The campaign caused the full change from the prior period.

Useful questions for a vendor call

Who was eligible for treatment and control before the campaign began?
What exact rule assigned or matched the comparison group?
How much treatment exposure reached the control group?
Which events, markets, users, or outcomes were excluded, and were those rules set before results were visible?
What result would have been called inconclusive even if the point estimate was positive?

Takeaway

A comparison group is not a formality. It is the claim. If the holdout or matched market does not represent what would have happened anyway, the readout may still be useful as operational reporting, but it should not be treated as strong causal evidence.

Keep reading

Choose the next guide

After choosing the comparison, move into the test plan, leakage checks, or finished readout review so the causal claim stays bounded.

Test planLock the design before launchDefine assignment, outcomes, effect size, leakage checks, downgrade rules, and readout language before results are visible. Control qualityCheck leakage and suppressionConfirm control users, households, stores, or markets were not reached through another route before trusting lift. Readout QAAudit the finished reportReview delivery, comparison fit, outcome quality, slices, and causal language before the result shapes budget.