Infovyn

The A/B Testing Assumption That's Costing Marketers Real Money

Most marketing teams run A/B tests the way they were taught in statistics class: calculate the required sample size for 80% power, hit that number, check the p-value, declare a winner. It feels rigorous. It's also, in many real-world campaigns, the wrong approach — and it's quietly draining profit from finite-audience experiments every day.

A 2019 academic paper by Elea McDonnell Feit and Ron Berman, Test & Roll: Profit-Maximizing A/B Tests, makes this case with mathematical precision. The core argument: when you're running experiments on a bounded population — a fixed email list, a capped media budget, a seasonal traffic window — the objective should be maximizing total profit across the test and rollout phases combined, not achieving a clean confidence interval.

Why Classical A/B Testing Wasn't Built for Marketing

The hypothesis-testing framework that dominates experimental design has roots in scientific research, where the priority is valid inference that generalizes across populations and time. Publishing a finding in a journal requires strict error control. You need to rule out false positives. You need replicable results.

Marketing experiments are a fundamentally different animal. You're not publishing findings — you're making a one-time deployment decision. You have 200,000 email subscribers, a campaign that runs through the end of the quarter, and a question: does version A or version B drive more revenue? The answer only needs to be right for this list, this quarter. And critically, every person you put in the test phase is a person who might get the weaker treatment before you've identified the winner.

That opportunity cost is invisible in classical power calculations. It doesn't appear in the formula. But it's real money.

The Two-Stage Framework — and What It Reveals

Feit and Berman formalize what most practitioners already do informally: test on a subset, then roll out the winner to the rest. They call this "test and roll," and they derive the mathematically optimal split between those two stages.

The math produces something counterintuitive. Profit-maximizing test sizes grow at roughly the square root of your total population size, while classical power-based sample sizes can grow much faster — particularly when effect sizes are small and response is noisy. In advertising, where lift is frequently in the low single digits and variance is high, classical methods can recommend test groups so large they consume most of the viable campaign audience. You're spending the entire budget learning, then rolling out to almost no one.

Under the profit-maximizing framework, the math recognizes this trade-off explicitly. A smaller test that correctly identifies the winner 70% of the time might generate more total profit than a larger test that's right 90% of the time, if that extra 20% certainty came at the cost of thousands of users receiving the inferior treatment during the test phase.

The "Underpowered" Label Needs Rethinking

For anyone who has sat in a marketing analytics review and heard "this test was underpowered," the Feit-Berman framework reframes the question worth asking: underpowered for what?

Underpowered for publication-grade inference? Almost certainly. Underpowered for a profitable business decision on a finite audience? That requires a completely different calculation — one that accounts for the size of your rollout population, your prior beliefs about likely effect sizes, and the variance in your outcome metric. A test that looks underpowered through a statistical lens may be exactly the right size through a profit lens.

This isn't a license for sloppy experimentation. It's a more honest accounting of what the test is actually for.

Unequal Splits Aren't Always Bad Design

One of the more practically useful findings in the paper concerns asymmetric test splits. Standard A/B testing convention pushes toward 50/50 splits because they maximize statistical efficiency per observation. But when your prior belief is that one treatment is meaningfully more likely to win — say, you're testing a proven creative against an unproven challenger, or running a holdout against an established CRM treatment — asymmetric splits can be the mathematically optimal choice.

This vindicates something practitioners in catalog marketing and CRM have done for years: running small holdout groups rather than equal splits. Critics often flag these as underpowered. The paper shows they can be exactly right, provided the asymmetry in test cell sizes reflects genuine asymmetry in prior beliefs about outcomes.

The practical implication is significant for any team managing a control group in an ongoing program. If you've been running a 90/10 split and defending it awkwardly in methodology reviews, you now have rigorous theoretical backing — as long as your priors justify it.

How This Compares to Bandit Algorithms

Multi-armed bandit algorithms — particularly Thompson sampling — are the natural comparison point for anyone thinking about adaptive experimentation. Bandits continuously reallocate traffic toward better-performing variants during the test itself, rather than committing to fixed test sizes upfront. In pure optimization terms, they generally outperform fixed-horizon testing.

The Feit-Berman paper benchmarks against Thompson sampling and finds the gap in expected profit is often modest. That's a meaningful result for practitioners, because the operational complexity of bandits is anything but modest. Continuous adaptation requires real-time data pipelines, creates governance headaches, and is genuinely difficult to explain to stakeholders who need to approve test designs or interpret results. A clean two-stage "test then roll" process — especially one grounded in profit-maximizing sample size calculations — is far more deployable across most organizations.

For teams not yet running bandits, this paper offers a middle path: keep the simple two-stage structure, but replace the statistical power calculation with a profit-maximizing one. The lift in outcomes relative to classical methods is likely larger than the lift from moving to bandits.

Putting It Into Practice

Implementing this framework doesn't require exotic tooling. The authors provide a Shiny app that runs the calculations directly. The required inputs are all things practitioners should already have: total reachable population for the campaign, historical estimates of response variance, and prior beliefs about likely treatment effects (drawn from past similar experiments).

The workflow shift matters as much as the math. Instead of asking "how large a sample do we need to reach significance?" the question becomes "how many test users maximizes expected profit across this campaign?" That reframing changes what gets reported to decision-makers, too. Expected profit at risk — the cost of being wrong — is a metric executives understand intuitively. P-values are not.

One underappreciated step in the framework: pre-committing the rollout decision rule before the test runs. The decision criterion should be posterior expected profit, not a significance threshold. Pre-commitment prevents post-hoc rationalization of results and keeps the experiment honest to its stated objective.

The Broader Shift in Experimentation Culture

The Feit-Berman framework is part of a wider methodological movement pushing back against cargo-cult significance testing in applied settings. Bayesian decision theory has long offered the theoretical foundation for treating experiments as inputs to decisions rather than ends in themselves. What papers like Test & Roll add is the applied scaffolding: specific formulas, validated on real data, for domains like email, display advertising, and direct mail where practitioners actually need them.

For data science and analytics teams embedded in marketing organizations, this is a meaningful upgrade to the standard toolkit. The statistical significance framework was borrowed from a context where it made sense. The profit-maximizing framework was built for the context where these teams actually operate.

The next time a stakeholder asks why the test group is smaller than expected, "because we optimized for profit, not p-values" is now a defensible, rigorous answer — not a rationalization.

Smarter A/B Testing: How Smaller Experiments Can Drive Greater Revenue

The A/B Testing Assumption That's Costing Marketers Real Money

Why Classical A/B Testing Wasn't Built for Marketing

The Two-Stage Framework — and What It Reveals

The "Underpowered" Label Needs Rethinking

Unequal Splits Aren't Always Bad Design

How This Compares to Bandit Algorithms

Putting It Into Practice

The Broader Shift in Experimentation Culture

Related Articles

Bitcoin Developers Race to Build Quantum-Resistant Security: What It Means for Your Crypto

Purple Promo Codes and Deals: Up to 30% Off

eBay Flash Sale: Score Up to 60% Off Tech, Electronics, and More