Bio
Independent machine-learning consultant. Former engineer at GitHub and Airbnb. Co-teaches the highest-grossing course on Maven (~2,000 PMs and engineers across 500 companies, including OpenAI and Anthropic) on LLM evaluation as a practical discipline. Public reference voice for "evals as error analysis" and the open-coding → axial-coding → LLM-as-judge pipeline.
Operating themes
- Evals are data analysis on LLM apps. Strip the mystique; treat as systematic measurement.
- Start with error analysis, not tests. Sample traces, write notes, then build judges.
- Binary judges, validated against humans. No Likert; always look at confusion matrix off-diagonal.
- Application-specific evals over generic. Cosine similarity and "hallucination score" don't work.
Cards
- Evals are systematic data analysis on your LLM application — start with error analysis, not tests — Treat evals as data analysis; start with error analysis [Tier A]
- Sample 100+ traces, write one free-form note per trace, let an LLM cluster the notes — humans first, machines second — Humans open-code traces; LLMs cluster the notes [Tier A]
- Build LLM-as-judge as binary true/false, one judge per pesky failure mode — and validate against human labels — Binary judges, validated against human labels with a confusion matrix [Tier A]
- Appoint one trusted-taste expert as the eval benevolent dictator — committees stall the loop — Appoint one trusted taste arbiter for the rubric [Tier B]
Sources captured
- 2026-04-28 — Lenny's Podcast (with Shreya Shankar), "Evals as error analysis, the benevolent dictator, LLM judges" (
raw/podcasts/hamel-husain-shreya-shankar--evals-error-analysis--2026-04-28.md)
External
- hamel.dev