Hamel Husain

Bio

Independent machine-learning consultant. Former engineer at GitHub and Airbnb. Co-teaches the highest-grossing course on Maven (~2,000 PMs and engineers across 500 companies, including OpenAI and Anthropic) on LLM evaluation as a practical discipline. Public reference voice for "evals as error analysis" and the open-coding → axial-coding → LLM-as-judge pipeline.

Operating themes

Evals are data analysis on LLM apps. Strip the mystique; treat as systematic measurement.
Start with error analysis, not tests. Sample traces, write notes, then build judges.
Binary judges, validated against humans. No Likert; always look at confusion matrix off-diagonal.
Application-specific evals over generic. Cosine similarity and "hallucination score" don't work.

Cards

Evals are systematic data analysis on your LLM application — start with error analysis, not tests — Treat evals as data analysis; start with error analysis [Tier A]
Sample 100+ traces, write one free-form note per trace, let an LLM cluster the notes — humans first, machines second — Humans open-code traces; LLMs cluster the notes [Tier A]
Build LLM-as-judge as binary true/false, one judge per pesky failure mode — and validate against human labels — Binary judges, validated against human labels with a confusion matrix [Tier A]
Appoint one trusted-taste expert as the eval benevolent dictator — committees stall the loop — Appoint one trusted taste arbiter for the rubric [Tier B]

Sources captured

2026-04-28 — Lenny's Podcast (with Shreya Shankar), "Evals as error analysis, the benevolent dictator, LLM judges" (raw/podcasts/hamel-husain-shreya-shankar--evals-error-analysis--2026-04-28.md)

External

hamel.dev

Bio

Operating themes

Cards

Sources captured

External

Insights · 4