a builder's codex
codex · operators · Hamel Husain · ins_evals-are-data-analysis-on-llm-apps

Evals are systematic data analysis on your LLM application — start with error analysis, not tests

By Hamel Husain · Independent ML consultant (ex-GitHub, Airbnb) and Berkeley PhD researcher · 2026-04-28 · podcast · Evals as error analysis, the benevolent dictator, LLM judges

Tier A · TL;DR
Evals are systematic data analysis on your LLM application — start with error analysis, not tests

Claim

LLM evals are not a new mystical discipline; they are systematic data analysis on stochastic application output. The trap is to jump straight to writing tests. Start with open-ended error analysis on real traces, identify the failure modes that survive prompt fixes, then build narrow binary judges only for those.

Mechanism

Tests written before error analysis test the wrong hypotheses — the team's pre-existing model of what could go wrong, not what is actually going wrong. A trace-first approach surfaces the real failure distribution: most issues are fixed by prompt edits and never need a permanent eval. Writing evals before that triage means you spend engineering time hardening against problems that no longer exist while the live failures go uncaught.

Conditions

Holds when:

Fails when:

Evidence

"It really doesn't have to be scary or unapproachable. It really is, at its core, data analytics on your LLM application."

"There's a common trap that a lot of people fall into because they jump straight to the test like, 'Let me write some tests,' and usually that's not what you want to do. You should start with some kind of data analysis to ground what you should even test."

Hamel and Shreya teach the highest-grossing course on Maven (~2,000 PMs/eng across 500 companies, including OpenAI and Anthropic).

— Hamel Husain & Shreya Shankar on Lenny's Podcast, 2026-04-28

Signals

Counter-evidence

For safety-critical or regulated domains, certain checks (PII, toxicity, jailbreaks) must be in place before any production traffic — there is no "wait and see." That is a separate class of evals, not the failure-mode evals this insight addresses.

Cross-references

Open the interactive view → View original source → Markdown source →