Claim
Run a permanent 5% holdout on every shipped change (or a 50/50 split on new-user cohorts) and revisit one to three years later. Roughly 30–40% of changes that produced a clear short-term lift will show no incremental long-term value. Most growth measurement misses this.
Mechanism
Short-term lifts are often pull-forward effects — accelerating signups that would have happened anyway — or low-quality user attraction that does not durable retain. Long-term holdouts catch this by separating "moved the metric" from "created incremental value." Without the holdout, the team takes credit for evaporated wins and re-invests behind theatre.
Conditions
Holds when:
- The product has cohort-tracking infrastructure that can sustain a 5% holdout for years.
- Leadership accepts the slower feedback loop (3, 6, 9, 12-month review cadences).
- The team has the discipline to act on negative findings rather than rationalise them.
Fails when:
- The product is too small or too new to have meaningful long-term cohorts.
- Decision-making cadence requires same-quarter wins; long-term findings arrive too late.
- The "win" is genuinely a one-time pull-forward (e.g., a launch event) where short-term is the right horizon.
Evidence
"We want to lower barriers to get started and help folks grow, and those winners make the whole thing work."
Finding: 30–40% of experiments showing short-term lift (paying signup, first sale) show NO incremental GMV lift after 1 year.
Archie has never seen short-term positive flip to long-term negative. So safe to ship — but don't take credit.
Operating discipline at Shopify: 5% always-on holdout, automated email reports at 3 / 6 / 9 / 12 months. "You can't hide from reality."
— Archie Abrams on Lenny's Podcast, 2026-04-28
Signals
- Quarterly growth dashboards diverge from one-year cohort GMV / ARPU dashboards.
- Initiatives that win the quarter look smaller in the annual review; budgets reallocate accordingly.
- Teams that previously over-claimed wins re-calibrate downward without political damage when the long-term data is automated and shared.
Counter-evidence
For very early-stage products, you don't have the volume or runway to run multi-year holdouts; short-term metrics are all you have. And the 30–40% number is Shopify's, not a universal constant — sample-size and category dependent. Some operators argue holdouts at scale leak revenue Shopify can afford to leak; smaller orgs cannot.
Cross-references
- Optimise for absolute count of users reaching each stage, not stage conversion rates — same operator, complementary measurement frame