Marketing mix modeling
MMM calibration evidence checklist
Calibration is the bridge between a marketing mix model and real-world experiments. It can make a model more useful, but only when the evidence being used as an anchor actually matches the effect the model is asked to estimate.
Use this checklist before a lift test, geo experiment, holdout, brand study, or external benchmark is used to tune a model, validate a channel contribution, or defend a budget recommendation. A calibration point is not a trophy. It is evidence with a unit, population, outcome, window, uncertainty, and decision boundary.
Start with the calibration job
Teams often say an MMM is calibrated without saying what the calibration is supposed to do. Name the job before accepting the evidence.
| Calibration job | Evidence needed | Weak substitute |
|---|---|---|
| Anchor one channel's incremental effect. | A credible test for the same channel, audience, market, outcome, and time window the model estimates. | A platform lift number from a different audience, season, or outcome treated as a universal truth. |
| Constrain a response curve. | Tests or historical spend variation near the current and proposed spend range. | A single average return applied to every future budget level. |
| Set priors for a sparse channel. | Transparent prior source, relevance test, uncertainty range, and sensitivity run. | A benchmark inserted without showing how much it drives the answer. |
| Validate channel ranking. | Multiple calibration points or sensitivity checks showing whether rank survives uncertainty. | One favorable experiment used to justify a full channel order. |
| Plan the next test. | A model uncertainty that maps to a practical audience, market, outcome, and detectable effect. | A generic plan to run a test after the budget decision has already been made. |
Calibration packet
Ask for this packet before debating whether the model has been validated. The point is to make the calibration evidence auditable.
Evidence sourceTest owner, method, field dates, channel, media tactic, geography, audience, creative mix, delivery quality, and whether the readout was completed before model fitting.
Estimand matchThe effect the evidence estimates and the effect the MMM estimates. Check whether both use incremental revenue, conversions, qualified leads, store visits, awareness, or another outcome.
Unit and populationUser, household, account, market, store, region, or time-series unit, plus the eligible population. A user-level test may not cleanly calibrate a market-level model without careful translation.
Window alignmentExposure window, outcome window, conversion lag, carryover, seasonality, and whether the MMM's adstock assumptions line up with the experiment's readout period.
UncertaintyInterval, minimum detectable effect, sample base, leakage risk, and whether the model uses the calibration point as a tight anchor or a broad prior.
Influence checkModel results with and without the calibration evidence, plus sensitivity to reasonable alternative priors or calibration weights.
Evidence match matrix
| Evidence type | Strong fit for MMM calibration | Downgrade when |
|---|---|---|
| Randomized conversion lift test | Assignment was protected, outcome source matches the MMM outcome, and the tested population is close to the modeled population. | Control leakage, post-assignment filtering, narrow retargeting eligibility, or a proxy outcome makes the test estimate a different effect. |
| Geo or store lift test | Treatment and control markets had similar pre-period trends, logged operating changes, and an outcome that maps to the model grain. | Markets were chosen after results were visible, local shocks were unlogged, or the test window is too short for the modeled carryover. |
| Matched-market comparison | Matching rule was locked before launch and sensitivity checks show the estimate is not driven by one convenient control. | The comparison is mainly size-matched while trend, seasonality, distribution, or competitor pressure differ. |
| Brand lift study | The model includes a brand or demand proxy and the survey design has balanced exposed and control respondents. | Survey lift is used to calibrate sales, revenue, or profit without evidence connecting the perception outcome to business impact. |
| External benchmark | The benchmark is used as a wide prior, with source, category, channel, outcome, and uncertainty visible. | It replaces internal evidence or forces a channel contribution because the model is otherwise unstable. |
Red flags
- The calibration section says "validated by lift tests" but does not name the tests, dates, audiences, outcomes, intervals, or model influence.
- A narrow platform test calibrates a broad channel that includes prospecting, retargeting, brand search, partner media, or different creative.
- A short-window experiment is used to confirm a long-run response curve without showing lag or carryover sensitivity.
- A brand metric is used to anchor sales contribution without explaining the bridge from perception to business outcome.
- The model recommendation changes materially when calibration priors are loosened, but the readout still presents one firm budget answer.
How to write the calibration note
A useful readout makes the calibration note short, specific, and bounded. It should say what evidence was used, how it entered the model, how much it changed the answer, and where it should not be generalized.
| Evidence condition | Careful wording | Wording to avoid |
|---|---|---|
| Relevant test with clean assignment and aligned outcome. | The model is calibrated to experimental evidence for this channel, population, and outcome range. | The model is proven correct. |
| Relevant but noisy test. | The test informs the prior, but uncertainty remains wide enough to limit budget-change confidence. | The test validates the channel return. |
| Different audience or outcome. | The evidence is directional because it estimates a related but different effect. | The same lift applies to the modeled channel. |
| External benchmark only. | The benchmark supplies a broad plausibility range until internal experiments are available. | The benchmark confirms expected performance. |
| Conflicting calibration points. | The model should show sensitivity and identify the next testable uncertainty. | The average of conflicting tests settles the question. |
Meeting questions
- Which calibration points were used, and which were rejected?
- Do the calibration points estimate the same outcome, population, and time window as the model?
- How does the model result change when each calibration point is removed?
- Are calibration points treated as tight constraints or broad priors?
- Which channel recommendation depends most on calibration assumptions?
- What experiment would most reduce uncertainty before the next budget decision?
Pair with
Use this checklist with the MMM causal validity checklist before accepting model contribution claims and the MMM readout QA checklist before budget allocation. When calibration evidence comes from experiments, pair it with the randomized lift test readout checklist, geo lift test design checklist, and comparison market and holdout planning guide. Use the source library for official references on MMM, outcomes, attention, and measurement quality.