Lift-test planning

Minimum detectable effect planning checklist

Published July 3, 2026. Updated July 3, 2026. Status: evergreen source page.

A lift test can be well randomized and still be unhelpful if it cannot detect the effect the decision needs. Minimum detectable effect planning keeps the test honest before the campaign, audience, outcome, and budget are locked.

Use this checklist before launching a conversion lift test, geo lift test, brand study, retail-media incrementality test, or private marketplace campaign readout. The goal is to decide whether the planned test can separate a meaningful effect from ordinary noise.

Editorial planning-board illustration showing sample size, base rate, test window, uncertainty, and action threshold checks before a lift result is used for a decision. — Minimum detectable effect planning is a pre-launch decision gate: the sample, base rate, test window, and uncertainty need to clear the business action threshold before a randomized result can become decision-grade evidence.

Start with the decision threshold

The minimum detectable effect should not be chosen only because a calculator returns it. Start with the smallest effect that would change the decision, then check whether the planned sample, base rate, and test window can detect it.

Planning field	Make visible before launch	Weak shortcut
Decision	The spend, renewal, creative, audience, market, or vendor decision the test will inform.	Running a test because a platform can provide one.
Action threshold	The minimum lift, margin, qualified lead rate, pipeline movement, brand shift, or revenue effect that would justify action.	Accepting any positive estimate as useful.
Base rate	The current outcome rate for the eligible population, with seasonality and conversion lag noted.	Planning around a category benchmark that does not match the tested audience.
Absolute effect	The percentage-point or unit difference implied by the business threshold.	Using relative lift alone, especially when the base rate is small.
Assigned unit	User, household, account, store, market, or placement group, plus the reason that unit can be separated.	Choosing the unit that is easiest to report rather than the unit that protects the counterfactual.
Available sample	Eligible units, expected delivery, reachable control size, event volume, and any matching or survey completion loss.	Counting all site traffic, customers, impressions, or markets as if all are eligible for the test.
Test window	Exposure period, outcome maturity window, lag allowance, and reporting cutoff.	Stopping when the first favorable readout appears.
Power and precision	Minimum detectable effect, planned power, interval width, and whether the test can detect the action threshold.	Calling an underpowered result negative or a noisy positive result proven.

Translate the business hurdle

From relative lift to absolute lift

A 20 percent relative lift can be tiny if the base outcome rate is tiny. Translate the decision into an absolute change in conversion rate, lead rate, margin, qualified pipeline, or brand response before judging whether the test is large enough.

From outcome volume to commercial value

Incremental outcomes should be tied to the metric that changes the decision. A test planned around raw leads may still be underpowered for qualified leads, pipeline, margin, or retained customers.

From statistical threshold to action threshold

Statistical detection is not the same as business usefulness. A test can detect a small effect that does not clear the payback hurdle, or miss a useful effect because the sample is too small.

From aggregate to segment

If the decision is about a segment, market, device, creative, or audience slice, plan power for that unit. A well-powered aggregate test does not automatically support confident slice rankings.

Power-readiness gates

Gate	Pass condition	If it fails
Outcome is common enough	The planned window produces enough observed outcomes to estimate the decision metric with useful precision.	Extend the window, choose a more mature outcome, increase eligible sample, or lower the decision stakes.
Control group can be protected	Suppression, holdout, market separation, or survey control rules keep the comparison meaningful.	Use the result as operational learning, not strong causal evidence.
Effect threshold is realistic	The action threshold is large enough to matter and small enough for the test to detect under the available sample.	Do not launch a high-stakes test that can only detect effects larger than the business expects.
Lag is included	The outcome window covers the expected response delay, sales cycle, survey fielding, or conversion maturity period.	Separate early diagnostics from final outcome claims.
Segments are preplanned	Priority slices, minimum bases, and downgrade rules are written before results are visible.	Label slice findings exploratory and retest before concentrating spend.
Stop rule is locked	The team knows when the test will end and what evidence triggers scale, redesign, repeat, or no action.	Avoid reading every interim result as a final decision.

MDE readiness score

Use this score before launch to decide whether the planned test can support the decision, should be redesigned, or should be labeled as a learning readout. The score is deliberately practical: it separates business threshold, available evidence, and allowed language.

Review field	Green	Yellow	Red
Action threshold	The smallest useful effect is written as an absolute outcome, margin, pipeline, or brand-response change.	The threshold exists, but it is still stated mostly as relative lift or a soft goal.	Any positive estimate will be treated as useful.
Detectable effect	The planned MDE is at or below the action threshold with the stated power and uncertainty method.	The MDE is slightly above the threshold, so the result can guide learning but not a high-stakes action by itself.	The test can only detect an effect larger than the business would realistically expect.
Outcome volume	Eligible units and mature outcomes remain sufficient after exclusions, match loss, survey completion loss, and conversion lag.	Volume is plausible, but one loss factor or maturity window still needs a written assumption.	The plan counts all traffic, impressions, customers, or markets as if all are eligible and mature.
Control protection	Treatment and control separation is enforceable for the assigned unit and the measurement window.	Some contamination risk is named, but the readout language has a downgrade rule.	Suppression, market separation, or holdout protection is assumed rather than operationally checked.
Segment language	Only preplanned segments with minimum bases can receive decision-grade language.	Segments can be reported as directional diagnostics if they are visibly underpowered.	Slice rankings will be used to move budget even when only the aggregate is powered.

If any row is red, do not launch the test as decision-grade without repair. If two or more rows are yellow, the launch brief should name the test as a learning readout, a design rehearsal, or a bounded decision rather than a broad proof of effectiveness.

Planning worksheet

Worksheet prompt	Record before launch
What decision changes if the test clears the threshold?	Scale, renew, pause, change creative, change audience, change bid, change market mix, or run a larger test.
What is the current base rate?	Outcome rate, qualified outcome rate, or brand response rate for the eligible population and period.
What absolute effect is worth acting on?	Percentage-point change, incremental outcome count, incremental margin, qualified pipeline, or material brand movement.
What sample is actually eligible?	Assigned units, expected treatment delivery, reachable control, exclusions, matching loss, and survey completion loss.
What is the planned minimum detectable effect?	The smallest effect the design is expected to detect with the stated power and uncertainty method.
What happens if the detectable effect is larger than the action threshold?	Change design, extend duration, reduce decision stakes, or mark the readout as learning-only.
Which slices matter before results are visible?	Preplanned segments, minimum bases, interval requirements, and exploratory labels.
Which language is allowed after each result pattern?	Scale within tested bounds, directional learning, inconclusive, operational issue, or unsupported causal claim.

Decision language by power pattern

Pattern	Careful interpretation	Do not say
The test is powered to detect the action threshold, assignment is clean, and the interval clears the hurdle.	The result supports the stated decision within the tested population, period, and outcome definition.	The channel will produce the same lift everywhere.
The test is underpowered for the action threshold and returns a near-zero estimate.	The result is inconclusive for the decision; it may not have been able to detect a useful effect.	The campaign had no effect.
The point estimate is positive but the interval is wider than the threshold.	The result is directional at best and needs more sample, stronger outcome quality, or a repeat test.	The test proved lift because the estimate was positive.
The aggregate is powered but planned segments are not.	The aggregate can guide a bounded decision; segment rankings need quieter language or a dedicated test.	The top segment won.
The test detects a statistically visible effect below the commercial hurdle.	The media may have moved the outcome, but the effect may not justify the planned action.	Statistical lift means the budget should scale.

Worked downgrade example

A campaign team wants a lift test to decide whether to renew a private marketplace package. The business owner writes that renewal needs at least 300 incremental qualified leads over the campaign window. Current qualified-lead rate is low, match loss is expected, and the planned test can only detect about 700 incremental qualified leads with useful precision.

The test can still be worth running, but the launch brief should downgrade the decision language before results are visible. A positive point estimate below the 700-lead detectable effect can be described as directional learning, not renewal proof. A near-zero estimate cannot be treated as evidence that the package had no value because the design could not reliably detect the smaller effect that would still matter to the business.

The repair is not to hide the power problem. The repair is to change one of the planning inputs: extend the test, increase eligible sample, use a more common mature outcome, lower the decision stakes, or precommit that the readout will only decide whether a larger test is justified.

Questions before launch

What is the smallest effect that would change the budget, renewal, creative, or audience decision?
Is that threshold stated as an absolute effect, not only a relative lift?
How many eligible units and mature outcomes will exist after exclusions, matching loss, and survey completion loss?
Can the planned design detect the action threshold with useful precision?
Which slices are decision-grade, and which will be labeled exploratory?
What readout language is allowed if the result is positive but underpowered, precise but too small, or inconclusive?

Takeaway

Minimum detectable effect planning protects the reader from false certainty in both directions. It prevents a noisy positive estimate from becoming a confident win, and it prevents an underpowered near-zero estimate from being treated as proof that nothing worked.

Topic routes

Move from power math into decision evidence.

Use these routes when the minimum detectable effect raises a design, leakage, uncertainty, or campaign readout question that needs to be settled before results are trusted.

Test designLock the counterfactual planDefine the decision, assignment unit, eligibility rules, primary outcome, and readout rule before the result is visible. Leakage reviewCheck control protectionUse when the sample size only matters if treatment and control remain meaningfully separated. UncertaintyRead intervals before point estimatesUse when a positive lift number needs precision, threshold, and downgrade rules before action. Decision routeTurn the test into readout languageMove from power and lift evidence into delivery quality, outcome maturity, comparison strength, and renewal posture.

Keep reading

Choose the next guide

Move from power planning into the design, leakage, and readout checks that decide whether a lift result can support action.

Test designLock the decision planDefine the counterfactual, assignment unit, outcome, threshold, and readout rule before launch. Leakage reviewProtect the control groupCheck suppression, overlap, household spillover, and partner delivery before power math becomes evidence. Readout QAInspect the lift resultReview assignment integrity, base rates, absolute lift, intervals, segments, and decision language.