Claim
LLM-as-judge evals should output a binary pass/fail per pesky failure mode, not a Likert scale. Build 4–7 narrow judges total, not dozens, because most failures are fixed by prompt edits and never need a permanent eval. Always validate the judge against human-labelled data using a confusion matrix, not a single agreement number.
Mechanism
A 1-to-5 Likert is "a weasel way of not making a decision" — it produces averages that look reasonable while masking the cases where the judge is wrong. A binary judge forces a callable decision and a falsifiable accuracy number. A confusion matrix surfaces the off-diagonal cells where a judge that says "pass" 90% of the time can hide near-total failure on the long tail. Without the matrix, teams trust scaffolding that has no real signal.
Conditions
Holds when:
- Each failure mode is genuinely binary (the output either has the kill-list word or it doesn't).
- A reviewer can produce labelled data for the same traces the judge sees.
- The team will retire judges as failure modes get fixed by prompts.
Fails when:
- The failure mode is genuinely graded (severity, urgency, fluency) and binary collapses real distinctions.
- Reviewers cannot produce ground-truth labels at sufficient volume.
- The team treats judges as permanent fixtures and never prunes.
Evidence
"1-2-3-4-5 is a weasel way of not making a decision."
"When people lose trust in your evals, they lose trust in you."
Operating rule Hamel and Shreya teach: most products end with 4–7 LLM judges total. Always look at the off-diagonal cells in the confusion matrix; agreement % is misleading on long-tail errors.
— Hamel Husain & Shreya Shankar on Lenny's Podcast, 2026-04-28
Signals
- Judge prompts are short and binary; their accuracy is reported with confusion-matrix detail, not single-number summaries.
- Number of permanent judges stays small even as the product grows.
- Trust in the eval dashboard rises because failures map to recognisable categories.
Counter-evidence
For ranking problems (which of these outputs is best?), pairwise comparison or graded scoring may be needed. The binary rule is conditional on detection-style judges, not preference-style ones.
Cross-references
- Sample 100+ traces, write one free-form note per trace, let an LLM cluster the notes — humans first, machines second — the categories that become judges
- Evals are systematic data analysis on your LLM application — start with error analysis, not tests — the broader frame