Campaign measurement

Campaign baseline comparison checklist

Published July 3, 2026. Updated July 3, 2026. Status: evergreen measurement reference.

A campaign result is only as useful as the comparison underneath it. Before a report says a campaign drove, lifted, improved, or outperformed, the reader needs to know what baseline it is being compared against.

Use this checklist when a campaign readout, vendor deck, dashboard, or renewal memo turns observed performance into a stronger claim. The goal is to separate clean descriptive reporting from directional evidence and from causal lift language.

Editorial baseline comparison review board showing one campaign outcome checked against trend fit, audience matching, tracking alignment, and holdout protection. — Baseline review starts with the headline campaign result, then asks whether the prior period, matched context, or holdout lane is strong enough for the claim. The figure shows why trend fit, mix alignment, tracking parity, and exposure protection need to be visible before observed response becomes lift language.

Baseline strength ladder

Not every decision needs a randomized test, but every decision needs a visible comparison. Start by naming the baseline type before debating the conclusion.

Comparison type	Useful for	Main weakness	Careful claim
No explicit baseline	Confirming delivery, tracking, and observed response.	No way to separate campaign response from normal demand.	The campaign delivered measured activity under the stated tracking rule.
Prior-period baseline	Fast readouts when seasonality, pricing, traffic mix, and tracking are stable.	The prior period may be a convenient chart choice rather than a fair counterfactual.	Observed outcomes were higher or lower than the selected prior period.
Matched context or audience baseline	Directional comparisons across similar pages, packages, audiences, or placements.	Matched groups can still differ in intent, reachability, creative, or destination quality.	The result is directionally stronger than the matched comparison, subject to matching limits.
Matched market or geo baseline	Market-level launches where user-level holdouts are impractical.	Pre-period trend, local seasonality, and concurrent changes can explain the gap.	The treatment markets outperformed the selected comparison markets within this design.
Protected holdout	Estimating incremental effect when assignment and suppression can be defended.	Leakage, noncompliance, underpowered cells, or narrow samples can weaken generalization.	The campaign produced measured lift for this audience, window, and uncertainty range.
Model baseline calibrated with experiments	Budget planning across channels, time, and business drivers.	Model assumptions and calibration quality determine how causal the estimate deserves to sound.	Multiple evidence streams support a bounded planning range.

Build the comparison packet

Ask for the packet before accepting the baseline. A weak comparison often looks strong because the report hides the fields that would make it testable.

Decision and claim

The report should state the budget, renewal, creative, package, or test decision the result is meant to inform, plus the exact sentence the evidence is supposed to support.

Primary outcome

Name the preselected outcome, denominator, event rule, conversion window, lead-quality filter, and reporting cutoff. If the primary outcome changed after results were visible, downgrade the claim.

Eligible population

Show who or what could enter the campaign and the comparison: users, households, markets, stores, pages, placements, devices, leads, or matched outcomes.

Pre-period evidence

Show levels and trends before launch. A matched group is weak if it only matches on size while demand, traffic quality, or outcome mix was moving differently.

Concurrent changes

List pricing, promotions, sales coverage, site changes, creative changes, inventory shifts, news-cycle effects, and other campaigns that changed during the readout window.

Exposure isolation

For holdouts or matched groups, show suppression logs, audience overlap, control exposure, channel leakage, and whether excluded users could still see the campaign elsewhere.

Baseline fit test

Use these checks before treating a comparison as fair enough for the decision.

Fit check	Ask for	Downgrade when
Level alignment	Pre-period outcome levels by treatment and comparison group.	The baseline group starts far above or below treatment without adjustment or explanation.
Trend alignment	Pre-period slopes across several comparable windows.	The groups were already diverging before the campaign started.
Mix alignment	Device, geography, audience, placement, creative, product, and destination mix.	The treatment group received a materially easier or harder mix.
Opportunity alignment	Eligibility, reachability, inventory availability, frequency, and bid conditions.	The comparison group had less opportunity to be reached or convert.
Tracking alignment	Tag status, event rules, attribution windows, match rates, and data-lag cutoffs.	One side has better measurement coverage than the other.
Outcome maturity	Conversion lag, sales follow-up status, brand-study field dates, and late-arriving outcomes.	The readout compares mature outcomes with immature outcomes.
Decision threshold	The minimum effect, cost, quality, or uncertainty threshold that would change the decision.	The report celebrates a difference too small or too uncertain to act on.

Worked baseline-fit score

When the report has a plausible baseline but not a clean experiment, score the comparison before writing the conclusion. The score is not a statistical test. It is a decision guardrail that keeps the language from becoming stronger than the packet.

Gate	What the packet shows	Score	Claim effect
Pre-period level and trend	Treatment markets were already rising faster for six weeks, while the comparison markets were flat.	0 / 2	Do not use the full post-launch gap as campaign lift.
Mix and opportunity	The treatment group had more high-intent placements and a cleaner landing path during the readout window.	1 / 2	Keep the result directional unless the readout adjusts for easier opportunity.
Tracking and maturity	Tags, match rates, event rules, and conversion-lag cutoffs were similar across both sides.	2 / 2	Tracking parity does not rescue the baseline, but it removes one alternative explanation.
Concurrent changes	A regional promotion and distribution change overlapped the campaign in several treatment markets.	0 / 2	Separate campaign response from timing, promotion, and availability effects.
Decision threshold	The adjusted estimate is 1.5 percentage points against a prewritten 4 point renewal hurdle.	1 / 2	Use the result for test planning, not scale or renewal proof.

This packet scores 4 out of 10. A defensible memo can say the campaign coincided with stronger observed response in selected markets, but the comparison is too weak for lift or ROAS proof. The next decision should be a repaired matched-market plan, a protected holdout where possible, or a smaller optimization that does not need causal language.

Language by evidence level

The same table can support very different wording depending on the baseline. Keep the conclusion inside the comparison.

Evidence in hand	Use this language	Avoid this language
Tracked response only	The campaign produced observed visits, leads, or matched outcomes under the stated tracking rule.	The campaign created new demand.
Prior-period comparison	Observed outcomes were higher than the selected prior period, with these known context changes.	The campaign caused the full before-and-after increase.
Matched baseline	The campaign outperformed the matched comparison on this outcome, with remaining mix and selection limits.	The matched result proves incremental lift.
Protected holdout with leakage checks	The campaign produced measured lift within the design, window, sample, and uncertainty range.	The result will generalize to every future campaign.
Model plus calibration evidence	The model and calibration evidence support this planning direction, with stated assumptions and sensitivity.	The model has proven exact channel contribution.

Common baseline traps

Choosing the weakest prior week, month, or quarter after the campaign result is known.
Comparing a high-intent campaign audience with all site visitors, all customers, or all reachable households.
Pooling placements, devices, or creative units when one side had better visibility, lower friction, or stronger destination quality.
Using attribution-window credit as if it were a comparison group.
Calling a matched market valid because it is similar in size while pre-period trends point in different directions.
Ignoring late-arriving conversions, lead disqualification, sales follow-up gaps, or data-lag cutoffs.
Treating a model baseline as objective without showing controls, calibration evidence, uncertainty, and sensitivity.

Meeting script

What is the exact baseline used in this readout?
Was the comparison chosen before results were visible?
Do treatment and comparison groups align on pre-period level, trend, mix, and tracking quality?
What changed during the campaign window that could explain the same outcome?
Which claim survives if attributed outcomes are described as observed response instead of lift?
Does the next decision need descriptive optimization, a matched baseline, a protected holdout, or a calibrated model?

Pair with

Use this checklist with the campaign readout QA checklist when a finished report needs evidence-level language, the measurement method selector when the question may need a different method, the incrementality test plan template before a causal test is launched, the comparison market and holdout planning guide before selecting markets or controls, the randomized lift test readout checklist when assignment is controlled, the holdout leakage and suppression QA checklist when the control group may have been exposed, the uncertainty interval readout checklist when a point estimate needs a decision range, the outcome quality scorecard before volume becomes a value claim, and the evidence-to-claim language matrix when the final sentence needs safer wording.

Keep reading

Choose the next guide

After naming the baseline, move into readout QA, comparison design, or claim language so the result stays inside the evidence.

Readout QAAudit the campaign reportCheck whether delivery, response, quality, attribution, and lift language match the comparison used. Comparison designProtect the controlChoose markets, holdouts, matching rules, downgrade triggers, and readout boundaries before launch. Claim languageWrite the safer sentenceTranslate tracked response, matched baselines, holdouts, and models into wording the evidence can support.