Claim
The right benchmark for "is this AI system transformative?" is not a generic capability test but the Economic Turing Test: contract an agent for one to three months on a specific job; if you would hire it back, having believed it was a person, it has passed for that role. Aggregate to a money-weighted basket of jobs and call 50% the threshold for transformative AI.
Mechanism
Generic AGI debates are unfalsifiable. The Economic Turing Test grounds the question in revealed-preference: would a real buyer pay a real wage for the agent's output, blind to its origin? The answer is per-role and per-workstream, which makes it actionable for product teams. It also re-orders the conversation from "is the model smart enough?" to "for which specific workstream does this agent already pass?" — a question with concrete answers and concrete economic stakes.
Conditions
Holds when:
- The role has a market-clearing wage for human labor (a contractor benchmark exists).
- The work product is evaluable by the buyer over a meaningful time window.
- The buyer is willing to evaluate honestly rather than pattern-match on origin.
Fails when:
- The role is so novel there is no contractor benchmark.
- Evaluation is cheap to game (one-shot tasks the agent can fake at the surface but not maintain).
- The buyer's bias against AI overrides their own commercial judgment.
Evidence
"If you contract an agent for a month or three months on a particular job, if you decide to hire that agent and it turns out to be a machine rather than a person, then it's passed the Economic Turing Test for that role."
Reference points Mann anchors against: Fin/Intercom resolves 82% of customer-service tickets fully automated; the remaining 18% is genuinely harder. Anthropic's own engineering reports 95% of Claude Code's code is written by Claude with 10–20x output multiplier.
— Benjamin Mann on Lenny's Podcast, 2026-04-28
Signals
- Specific workstreams cross the threshold and become "agent-default" with humans on review.
- Money-weighted percentage of agent-passable workstreams climbs over quarters.
- Pricing conversations shift from per-seat to per-outcome (ties into Bret Taylor's outcomes-based pricing thesis).
Counter-evidence
The test isolates economic substitution but ignores trust, legal exposure, and failure-mode novelty. An agent might pass the test for ninety days and then produce a catastrophic compounding error a human would not. Aggregation to a 50% threshold for "transformative AI" is provocative but arbitrary; reasonable operators can disagree about the right denominator and weighting.
Cross-references
- Use new tools as new tools, not as old tools — be ambitious and retry from scratch — the operator-side complement
- Give the model tools and a goal; do not hard-code the workflow — the architectural prerequisite for passing the test at scale