a builder's codex
codex · operators · Simon Willison · ins_november-2025-coding-inflection

November 2025 was the qualitative threshold — coding agents now almost always do what you tell them

By Simon Willison · Co-creator of Django; creator of Datasette; independent AI/engineering writer · 2026-04-02 · podcast · Simon Willison on agentic engineering and the November 2025 inflection — Lenny's Podcast

Tier A · TL;DR
November 2025 was the qualitative threshold — coding agents now almost always do what you tell them

Claim

Between GPT 5.1 and Claude Opus 4.5, coding agents crossed a discrete threshold from "mostly works" to "almost always works." That single shift, dating from November 2025, is the qualifier separating "AI-native" as aspiration from "AI-native" as operating reality.

Mechanism

Reasoning models gave coding agents the ability to think through problems before producing output. Combined with longer context and stronger tool use, the failure rate per task dropped past the threshold where humans stop verifying every step. Below that threshold, the human reviews everything and time-savings are marginal. Above it, the human reviews exceptions and time-savings compound. This is a phase change, not a slope change.

Conditions

Holds when:

Fails when:

Evidence

"Previously coding agents mostly worked, now they almost always do what you told them to do."

Simon reports ~95% of his own code is no longer typed by him personally. StrongDM has been practicing "nobody reads the code" since August 2025 — security software, no less — gated by a simulated-QA swarm running thousands of agent-driven test users in a vibe-coded simulated Slack/Jira/Okta. ~$10K/day in tokens, robust enough to ship security-critical infrastructure.

— Simon Willison on Lenny's Podcast, 2026-04-02

Signals

Counter-evidence

The threshold is task-specific. Code is the bellwether because code is verifiable; essays, lawsuits, and plans don't have a unit test. Operators who extrapolate "November happened for code" to "November happened for marketing" will overpay for current-model marketing agents and underbuild eval infrastructure. The right move is to invest in the eval/verification layer for each domain so that domain's threshold can arrive.

Cross-references

Open the interactive view → View original source → Markdown source →