The Misuse of Geo-Holdout Tests
If you’ve ever found yourself sifting through case studies or research promising to reveal the “true” incremental ROAS of your campaigns via geo-holdout tests, you’re not alone. These tests can feel like a perfect solution for proving ad impact, but they often oversell their precision and hide subtle pitfalls.

What Are Geo-Holdout Tests?
- Control Group: Where your marketing campaign (e.g., YouTube ads) runs.
- Test Group: Where you intentionally withhold that campaign.
Why Geo-Holdout Tests Get Misused
- Regions Aren’t Twins No two regions are identical. Your test region might have a thriving economy while your control region struggles. Demographics — like age, income, or even weather — could differ too. If sales jump in the test region, is it your ads or just local conditions? It’s hard to tell.
- Cross-Contamination Across Channels Keeping a control group truly "ad-free" for one channel is nearly impossible today. If you withhold YouTube ads in California, your other channels — like Facebook or Google Ads — might automatically compensate, targeting that audience more aggressively. This doesn’t just muddy your control group; it can skew your tracking pixels and disrupt your broader marketing strategy. More on this below.
- In-Channel Contamination Even within the channel you’re testing, such as YouTube, excluding specific regions from targeting can introduce complications. For instance, if you withhold YouTube ads in California with a defined daily budget, that unused budget doesn’t simply disappear. Instead, the platform may redirect it to the remaining regions, including your control group in Texas. This can result in Texas receiving double the intended ad spend, significantly amplifying exposure there. Consequently, the results may appear inflated, suggesting the ads are more effective than they would be under normal conditions. This undermines the geo-holdout test’s integrity, as the control region no longer reflects a baseline scenario, leading to an overestimation of the campaign’s true impact.
- Not Enough Data to Work With Unlike user-level A/B tests with thousands of data points, geo-tests use just a handful of regions — think states or cities. This small sample size makes it tough to detect subtle but meaningful effects, especially with a limited budget.
- Life Happens External factors — holidays, local events, or seasonal trends — can hit one region harder than another. If your test group experiences a natural sales surge during the test, you might wrongly credit your ads.
- Reading Too Much Into It This is where I see leaders stumble most: treating geo-tests as definitive proof of a campaign’s value. They can suggest whether a channel adds value (yes or no), but they’re unreliable for pinpointing how much! Overrelying on them sets you up for risky calls.
The Margin of Error: A Real-World Wake-Up Call
- 5% Lift: That’s $500,000 extra revenue (5% of $10M) from YouTube ads.
- ±4% Margin: The true lift could be anywhere from 1% ($100,000) to 9% ($900,000).
- At 5% lift, ROAS is 2.5 ($500,000 revenue / $200,000 spend) — decent.
- At 1% lift, ROAS falls to 0.5 ($100,000 / $200,000) — you’re in the red.
- At 9% lift, ROAS climbs to 4.5 ($900,000 / $200,000) — a slam dunk.
- Shaky Foundations: Geo-tests already have a wide margin of error — like that ±4% we discussed. A 5% lift could really be 1% or 9%. Multiplying YouTube’s 1% by 5x based on a potentially off-base estimate can massively overstate its impact — or understate it if the true lift is lower.
- Amplifying Volatility: Platform attribution numbers, like that 1%, are often tiny and fluctuate weekly due to seasonality, promotions, or random user behavior. If one week YouTube’s attribution spikes to 3% from a fluke, a 5x coefficient would claim a 15% contribution — wildly misleading when the next week it drops back down.
- Mixing Apples and Oranges: Geo-tests measure causal lift, while platform attribution often leans on correlation. Applying a coefficient from one to the other ignores this mismatch, distorting your view of what’s really driving results (for example, lift might have been caused by completely different campaigns that didn't even have attribution-reported conversions).
Consider the True Cost of Such Tests!
Key Takeaways for Top Managers
- Direction, Not Details: Use them to see if a channel’s worth pursuing (yes/no), not to nail down exact returns. And only when you are uncertain if your ads are incremental at all (like TV ads, Out-Of-Home ads, etc)
- Mind the Margin: A 5% lift with ±4% error means it could be 1% or 9% — account for that uncertainty (and this uncertainty is huge to even consider it for actual iROAS calculations!).
- Stay Humble: Don’t let one geo-test dictate your strategy. It’s a piece of the puzzle, not the whole picture. And beware from anyone telling you that this is a reliable method to measure incremental ROAS!
