Campaign measurement
Campaign baseline comparison checklist
A campaign result is only as useful as the comparison underneath it. Before a report says a campaign drove, lifted, improved, or outperformed, the reader needs to know what baseline it is being compared against.
Use this checklist when a campaign readout, vendor deck, dashboard, or renewal memo turns observed performance into a stronger claim. The goal is to separate clean descriptive reporting from directional evidence and from causal lift language.
Baseline strength ladder
Not every decision needs a randomized test, but every decision needs a visible comparison. Start by naming the baseline type before debating the conclusion.
| Comparison type | Useful for | Main weakness | Careful claim |
|---|---|---|---|
| No explicit baseline | Confirming delivery, tracking, and observed response. | No way to separate campaign response from normal demand. | The campaign delivered measured activity under the stated tracking rule. |
| Prior-period baseline | Fast readouts when seasonality, pricing, traffic mix, and tracking are stable. | The prior period may be a convenient chart choice rather than a fair counterfactual. | Observed outcomes were higher or lower than the selected prior period. |
| Matched context or audience baseline | Directional comparisons across similar pages, packages, audiences, or placements. | Matched groups can still differ in intent, reachability, creative, or destination quality. | The result is directionally stronger than the matched comparison, subject to matching limits. |
| Matched market or geo baseline | Market-level launches where user-level holdouts are impractical. | Pre-period trend, local seasonality, and concurrent changes can explain the gap. | The treatment markets outperformed the selected comparison markets within this design. |
| Protected holdout | Estimating incremental effect when assignment and suppression can be defended. | Leakage, noncompliance, underpowered cells, or narrow samples can weaken generalization. | The campaign produced measured lift for this audience, window, and uncertainty range. |
| Model baseline calibrated with experiments | Budget planning across channels, time, and business drivers. | Model assumptions and calibration quality determine how causal the estimate deserves to sound. | Multiple evidence streams support a bounded planning range. |
Build the comparison packet
Ask for the packet before accepting the baseline. A weak comparison often looks strong because the report hides the fields that would make it testable.
Decision and claimThe report should state the budget, renewal, creative, package, or test decision the result is meant to inform, plus the exact sentence the evidence is supposed to support.
Primary outcomeName the preselected outcome, denominator, event rule, conversion window, lead-quality filter, and reporting cutoff. If the primary outcome changed after results were visible, downgrade the claim.
Eligible populationShow who or what could enter the campaign and the comparison: users, households, markets, stores, pages, placements, devices, leads, or matched outcomes.
Pre-period evidenceShow levels and trends before launch. A matched group is weak if it only matches on size while demand, traffic quality, or outcome mix was moving differently.
Concurrent changesList pricing, promotions, sales coverage, site changes, creative changes, inventory shifts, news-cycle effects, and other campaigns that changed during the readout window.
Exposure isolationFor holdouts or matched groups, show suppression logs, audience overlap, control exposure, channel leakage, and whether excluded users could still see the campaign elsewhere.
Baseline fit test
Use these checks before treating a comparison as fair enough for the decision.
| Fit check | Ask for | Downgrade when |
|---|---|---|
| Level alignment | Pre-period outcome levels by treatment and comparison group. | The baseline group starts far above or below treatment without adjustment or explanation. |
| Trend alignment | Pre-period slopes across several comparable windows. | The groups were already diverging before the campaign started. |
| Mix alignment | Device, geography, audience, placement, creative, product, and destination mix. | The treatment group received a materially easier or harder mix. |
| Opportunity alignment | Eligibility, reachability, inventory availability, frequency, and bid conditions. | The comparison group had less opportunity to be reached or convert. |
| Tracking alignment | Tag status, event rules, attribution windows, match rates, and data-lag cutoffs. | One side has better measurement coverage than the other. |
| Outcome maturity | Conversion lag, sales follow-up status, brand-study field dates, and late-arriving outcomes. | The readout compares mature outcomes with immature outcomes. |
| Decision threshold | The minimum effect, cost, quality, or uncertainty threshold that would change the decision. | The report celebrates a difference too small or too uncertain to act on. |
Language by evidence level
The same table can support very different wording depending on the baseline. Keep the conclusion inside the comparison.
| Evidence in hand | Use this language | Avoid this language |
|---|---|---|
| Tracked response only | The campaign produced observed visits, leads, or matched outcomes under the stated tracking rule. | The campaign created new demand. |
| Prior-period comparison | Observed outcomes were higher than the selected prior period, with these known context changes. | The campaign caused the full before-and-after increase. |
| Matched baseline | The campaign outperformed the matched comparison on this outcome, with remaining mix and selection limits. | The matched result proves incremental lift. |
| Protected holdout with leakage checks | The campaign produced measured lift within the design, window, sample, and uncertainty range. | The result will generalize to every future campaign. |
| Model plus calibration evidence | The model and calibration evidence support this planning direction, with stated assumptions and sensitivity. | The model has proven exact channel contribution. |
Common baseline traps
- Choosing the weakest prior week, month, or quarter after the campaign result is known.
- Comparing a high-intent campaign audience with all site visitors, all customers, or all reachable households.
- Pooling placements, devices, or creative units when one side had better visibility, lower friction, or stronger destination quality.
- Using attribution-window credit as if it were a comparison group.
- Calling a matched market valid because it is similar in size while pre-period trends point in different directions.
- Ignoring late-arriving conversions, lead disqualification, sales follow-up gaps, or data-lag cutoffs.
- Treating a model baseline as objective without showing controls, calibration evidence, uncertainty, and sensitivity.
Meeting script
- What is the exact baseline used in this readout?
- Was the comparison chosen before results were visible?
- Do treatment and comparison groups align on pre-period level, trend, mix, and tracking quality?
- What changed during the campaign window that could explain the same outcome?
- Which claim survives if attributed outcomes are described as observed response instead of lift?
- Does the next decision need descriptive optimization, a matched baseline, a protected holdout, or a calibrated model?
Pair with
Use this checklist with the campaign readout QA checklist when a finished report needs evidence-level language, the measurement method selector when the question may need a different method, the incrementality test plan template before a causal test is launched, the comparison market and holdout planning guide before selecting markets or controls, the randomized lift test readout checklist when assignment is controlled, the holdout leakage and suppression QA checklist when the control group may have been exposed, the uncertainty interval readout checklist when a point estimate needs a decision range, the outcome quality scorecard before volume becomes a value claim, and the evidence-to-claim language matrix when the final sentence needs safer wording.